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Abstract 

The major goal of this paper is to study the second order frequentist properties of the 
marginal posterior distribution of the parametric component in semiparametric Bayesian mod¬ 
els, in particular, a second order semiparametric Bernstein-von Mises (BvM) Theorem. Our 
first contribution is to discover an interesting interference phenomenon between Bayesian esti¬ 
mation and frequentist inferential accuracy: more accurate Bayesian estimation on the nuisance 
function leads to higher frequentist inferential accuracy on the parametric component. As the 
second contribution, we propose a new class of dependent priors under which Bayesian inference 
procedures for the parametric component are not only efficient but also adaptive (w.r.t. the 
smoothness of nonparametric component) up to the second order frequentist validity. However, 
commonly used independent priors may even fail to produce a desirable root-n contraction rate 
for the parametric component in this adaptive case unless some stringent assumption is imposed. 
Three important classes of semiparametric models are examined, and extensive simulations are 
also provided. 

Key words: Bernstein-von Mises theorem; second order asymptotics; semiparametric model. 


1. Introduction 


A semiparametric model is indexed by a Euclidean parameter of interest 9 € 0 C M p and an 
infinite-dimensional nuisance function r] belonging to a Banach space "H. For example, in the Cox 
proportional hazards model, 9 is a regression covariate vector corresponding to the log hazard 
ratio, while rj is a cumulative hazard function. By introducing a joint prior II on the product 
space © x "H, we can make Bayesian inferences for the parameter of interest, e.g., credible set, 
through MCMC sampling from the marginal posterior distribution. The frequentist validity of 
these Bayesian proce dures i s kno w n to be supported by th e semipar a metric Bernstein-von Mises 
(BvM) theorem (see Shen (2001); Bickel and Kleijn ( 20121 ): Castillol (2012b)), which states that 
the marginal posterior distribution of 9 is asymptotically normal and satisfies frequentist criteria 
of semiparametric efficiency. More precisely, it is proven to converge (in total variation norm) to 
a Gaussian limit centered at a semiparametric efficient estimate, with covariance matrix equal to 
the inverse of the efficient Fisher information: 

sup |II(0 £ A\X U ...,X n )- N p (9 0 + n~ 1/2 A n , (n/e 0 , w ) _1 )(A)| 0, (1.1) 
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where A is any measurable subset of 0 and 




( 1 . 2 ) 


i= 1 


reflects a random fluctuation on the center of the posterior distribution. Here, P () 0 , 7]0 denotes the 
true underlying distribution that generates the data (Xi,... , X n ), where do and rjo are true pa- 
p , p 

rameter values; -w ” and “ —> ” denote the weak convergence and convergence in probability P, 
respectively. In the expression displayed above, Iq^ () represents the efficient Fisher informa¬ 
tion matrix (efficient score function) evaluated at ( 9,r /). Please see Section [2.11 for a brief review 
on semiparametric efficiency theory. We call a as the first order version of semiparametric 
BvM theorem. The rece n t studies of BvM t h eorem in the nonparametric context can be found in 
Shang and Cheng ( 2014j l : Castillo and Nickl ( 2014h . 

The major goal of this paper is to conduct second order studies of semiparametric BvM theorem 
by characterizing the decay rate of the remainder term in ( 11 .ip , which we name as the second order 
Bayesian efficiency. This efficiency consideration is crucial for us to understand the influence of 
the nonparametric prior on the semiparametric Bayesian inferential accuracy, and further provide 
guidance in choosing an appropriate nonparametric prior (or more generall y, a joint pr ior n). 
We rem a rk th at our second order result is radically different from those in ICheng and Kosorok 
( 2008a b, 2009h where the nuisance function is profiled out, and thus no nonparametric prior needs 
to be assigned. Therefore, as far as we are aware, our work is the first study on the second order 
semiparametric BvM theorem in a fully Bayesian framework. 

Our main conclusion is that the second order efficiency in (11.11) is of the order Op eQ , 10 (VpPn) 
(upto a logarithmic term), where p n refers to the posterior contraction rate of the nuisance function 
throughout the paper. For example, in the partially linear models, the posterior contraction rate 
achieves p n = n ~ a K 2a + d ) (log n ) 7 for some 7 > 0 , which is known to be (almost) minimax optimal, 
when an appropriate Gaussian process (GP) prior is assigned to the d-dimensional nonparametric 
function of ce-smoothness. Our general result implies an interesting interference phenomenon be¬ 
tween Bayesian estimation and frequentist inferential accuracy: more accurate Bayesian estimation 
of the nuisance function leads to higher frequentist inferential accuracy on the parametric part. For 
example, we show that the credible set for 6 possesses a second order frequentist validity that is 
determined by p n . Please see Section [3721 for more discussions. Therefore, it is desirable to construct 
a nonparametric prior under which an optimal contraction rate can be achieved. For example, it 
is desirable to match the smoothness of the reproducing kernel Hilbert space (RKHS) induced by 
the assigned GP with that of true regression function. None of the aforementioned interesting 
conclusions can be inferred from the first order semiparametric BvM theorem. We also remark 


that ou r secon d order Bayesian efficiency is consistent with that derived in iGheng and Kosorok 
(I2008aljbl. 1200911 fup to a lo garithmic term) where no nonparametric prior is assigned. Note that 


Gheng and Kosorok (:2()08at l‘>. 20091 ) is not a fully Bayesian framework, and did not cover the adap¬ 


tive case considered in this paper. In the end, we point out that the above second order results are 
derived only under two intuitively appealing conditio ns: one is o n the posteri or concentration; an¬ 
other is on the integrated local asymptotic normality (jBicke l and Kleiinl . 20121). Interestingly, these 
two conditions (together with a set of sufficient conditio ns in Sectio n l3.3l) ar e not s t ronger than thos e 


imposed in the literature for the first order result, e.g., Bickel and Kleiin ( 2012h: Gastillo ( 2012b ) 


On the contrary, we even relax a stringent root-n convergence condition in Bickel and KleiinT ? 20121 )) 
to a set of commonly used conditions; see Lemma 13.41 

We further apply our general theory to two classes of priors varying by whether 6 and 7 are 
dependent or not. Surprisingly, we find that the commonly used independent prior is not the best 


2 































































choice for the second order semiparametric BvM theorem in the sense it requires a slightly strong 
condition (A6), and might even break down for the first order consistency when the smooth¬ 
ness of the nonparametric function is unknown; see Section IA.1I for more explanations. This 
fail ure is main ly due to the existence of a semiparametric bias term defined in (12.lft : also see 
Rivoirard and Rousseau ( 2012l h Interestingly, we show that the semiparametric bias can be easily 
eliminated through shifting the center of a nonparametric prior (by a ^-dependent quantity), which 
naturally leads to a general class of nonparametric priors. This re-centering idea is rather different 
from , an d perh aps easier to implement than, the prior under-smoothing procedure proposed in 
Castill o (j2012a|). Moreover, our dependent priors can be easily made adaptive with respect to the 
unknown smoothness of the nuisance function by re-centering a nonparametric adaptive prior. 

The rest of the paper is organized as follows. Section [2] provides necessary background on the 
semiparametric efficiency theory, and describes several semiparametric models including partially 
linear models and Cox proportional hazards models. Our main theorem, together with the related 
Bayesian inference results, is presented in Section [3j The classes of independent and dependent 
priors are extensively discussed in Section [H Section [5] illustrates the applicability of our general 
theory in three examples. All the technical proofs are postponed to the Appendix. 


2. Preliminaries 


In this section, we briefly review semiparametric efficiency theory ( Bickel et al. . 19981 ). and describe 
several semiparametric models considered in the paper. 


2.1 Review on Semiparametric Efficiency 

An estimator 9 n is semiparametric efficient if it achieves the minimal asymptotic variance V* 
over all regular semiparametric estimators 6 n that satisfy \/n(9 n — $o) N p (0,V) for some non¬ 

degenerate asymptotic variance V. It can be shown that the minimal V* exists and corresponds to 
th e largest asympto tic variance over all the parametric submodels {Pe, v (e) '■ & £ 0} with rj(8o) = tjq 


( Bickel et al. . 1998 ). The submodel achieving V* is called the least favorable submodel, and denoted 


as {Pg tT) *(o) '■ 6 £ 0}, where rf(0) is the so-called least favorable curve. We define the semiparametric 
bias as 


A V (9) = r ] *(9)-r,*(8 0 ) = V *(d)-r l0 , 


( 2 . 1 ) 


which will be frequently mentioned hereafter. Let io 0 ,rio t> e the score function of the least favorable 
submodel {Po, v *{6) '■ 8 £ ©} at 9 = 9 0 . Hence, we have V* = Ig 0 ] V0 , where Ig 0tTl0 = Eo^e 0 ,vo^e 0 ,vo- 
Note that ie 0 ,rjo an d ^9o,vo are a ^ so known as the efficient score function and efficient information 
matrix in the semiparametric literature. For simplicity, denote P() Q .n 0 - ^9 0 ,vo an d IOo,vo as A) an d 
Iq from now on. _ _ 

Scvcrini and Wong ( 1992 ) discovered that rj*(9) is essentially evaluated as the unique minimizer 


of Kullback-Leibler (KL) divergence in % with the parametric part 8 being fixed, i.e. 


r l*(fl) = arg iniK(Po.Po^), (2.2) 

where K(P,Q) = f log (dP/dQ)dP denotes the KL divergence between two measures P and Q. 
In the Bayesian regime, the least favorable curve rf{9) can be understood as the function to¬ 
wards which the conditional posterior distribution of the nuisance parameter rj given 9 contracts 
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( Kleiin and van der Vaartl . 20061 ). Therefore, the posterior distribution of ( 9,r /) tends to concen¬ 
trate around the true value (0q,t]q) under a well chosen prior II. We use the following examples to 
illustrate the above concepts, in particular rf{0). 


2.2 Generalized Partially Linear Model (GPLM) 

Suppose that the data X t = ( Ui,Vi,Yi ), i = l,...,n, are i.i.d. copies of X = ( U,V,Y ), where 
heKis the response variable and T = (U, V ) € [0, l] p x [0, l] d is the covariate variable. Consider 
a general class of semiparametric regression models with the following partially linear structure: 

m 0 (t) = E 0 (Y\T = t) = F(g 0 (t)), g 0 (t) = 9 q u + r/ 0 (u), t = (u,v), 


where F : M —» R is some known link function and rjo is some unknown smooth function. Here the 
notation Eq means the expectation under the true data generating probability measure Po = Poom- 
T he above cla ss of semiparametric models is called generalized partially linear models (GPLM) 


( Boente et ah . 20061 ). Interestingly, we can derive an explicit expression of the least favorable 


curve (see Lemma I A. If) when the log-likelihood is written in the following forirQ: log p(y,m) = 
jy n (y — s)/V(s)ds, where V(mo(T)) = Var(Y\T). We next apply Lemma [A.ll to three concrete 
models. 


Example 2.1 (Partially linear models). Consider a partially linear regression model 

Y = U T e 0 + r l0 (V) + w, 


(2.3) 


where w ~ IV(0,1) is assumed to be independent of (U, V) and tjq belongs to a Holder function class 
C“([0, l] d ) with smoothness index a. In this case, F(t ) = t and V(s) = 1. Based on Lem,m,a \A.l\ 
we obtain the least favorable curve as 

y*(0)(v) = m (v) -(9- 9 0 ) t E[U\V = v). (2.4) 


For identifiability, we assume that E{U — E\U\V])® 2 is invertible. 

Example 2.2 (Partially linear exponential models). In the partially linear exponential model, the 
conditional density of Y given (U, V ) is 

Po(y\u, v) = A 0 (u, v ) exp(—A 0 (u, v)y), y > 0, (2.5) 

with Xq(u,v) = exp{— (u t 9q + f]o{v))}. In this case, F(t) = e t and V(s) = s 2 . Therefore, by Lemma 
gch we have 

y*mv) = no(v) - 0 9 - 9o) t E[U\V = v] + 0(\6 - 0 O | 2 ). (2.6) 

Example 2.3 (Partially linear logistic models). In the partially linear logistic model, we observe 
binary Yi € { 0 , 1 } and model the data as 


In this case, F(t) 
curve as 


log 


P 0 (Y = 1\U,V) \ 
P 0 (Y = 0\U,V)j 


e t /(I + e*) and V(s) = s(l — s). 


= U T e o + yo(V). (2.7) 

Again, Lemma \A. 1\ imvlies the least favorable 


rf(9)(v) = r] 0 (v ) -(9- 9 0 ) 


T E[Uf 0 (U,V)\V = v\ 
E[f 0 (U,V)\V = v] 


+ O(\9-0o\ 


( 2 . 8 ) 


where fo(u,v) = exp (u T 0 0 + r] 0 {v))/(l + exp {u T d 0 + r? 0 (u ))) 5 


1 This form is also called as quasi-likelihood in I Wedderburnl (1974) 
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2.3 Cox Proportional Hazards Model 


Let Z G K p be a vector of covariates, T the survival time that follows a Cox model, and C a 
random observation time. The Cox model assumes that the conditional hazard function given Z 
satisfies P(T G [t,t + dt]\T >t,Z) = exp{0Q Z)Xo{t)dt, where Ao is an unknown baseline hazard 
function. Assume that T is independent of C given Z and that there exists a real t > 0 such that 
Pq(T > t) >0 and Pq{C > t) = Pq{T = t) > 0. In the Cox model with current status data, the 
observed data are n i.i.d. realizations of X = {C,5,Z), where 5 = I(T < C ). The density of X 
relative to the product of the marginal joint density of (C, Z) and counting measure on {0,1} is 
given by 


P0,a( x ) = 




1—<5 


where A(c) = [f A (t)dt i s con sidered as the nuisance parameter. By 
25.11.1 of van der Vaart ( 1998 ). the least favorable curve is given by 


the derivations in Section 


A*(0)(c) 


A 0 (c) - (0 - 0 O ) T A 0 (c) 


E[ZQ 2 9oM {X)\C = c] 
E[Ql M {X)\C = c] 


+ O(\0-0 o \ 2 ), 


(2.9) 


for the function Qo,a given by 


Qe, a( x ) 


exp (0 T z) 


exp(— e eTz A{c)) 

1 — exp(— e eTz k{c)) 


{1-5) 


3. Main Results 

3.1 Second Order Semiparametric BvM Theorem 

For a general class of semiparametric models V = {Po, v '■ 0 € 0 , 7 ? G TL}, we consider a joint 
prior distribution II over the product space 0 x 7~L for the parameter pair ( 0 , rj). In the sequel, we 
use notation 11 ^ ( 77 ) and ne( 0 ) to denote the conditional prior distribution of rj given 0 and the 
marginal prior distribution of 0, respectively. 

Our main theorem is based on two primary assumptions, which we will revisit in Section [3731 The 
first one is a convergence condition for {0, rj). It allows us to focus on the posterior mass in a suitable 
neighborhood of (#0 ,Vo) : {(^j 7 ?) : 1$ — #o| < Cn,h € Tin}, where | • | denotes the Euclidean norm. 
Here, 7~L n is a sequence of subsets of the nuisance space Ti that satisfies n(7y G 7~L n \X \,..., X n ) 1. 

For example, 7L n can be defined as {rj : ||77 — rj o|| n < Mp n } n Pn, where Pn is a sieve sequence for 
the nuisance parameter defined after Lemma 13.41 and \\f\\ n = (n ^ 1 X)r=i / 2 (Ai)) 7 7S an empirical 
L 2 -norm. Recall that p n denotes the contraction rate of marginal posterior distribution of p. 

Assumption 1 (Localization condition). There exists a sequence e n —» 0 satisfying ne\ —> 00 and 
a sequence of subsets { TL n } C TL, such that as n —» oo, 

n(|0 - 0 o \ <e n ,pG Un\X u ...,X n ) = l- Op 0 {6 n ) 


for some 5 n —>• 0. 

In a general setup, Lemma 13.41 in Section T3.3I provides a set of sufficient conditions for Assump¬ 
tion Q] with e n = p n . Throughout the paper, we always choose e n to be p n . 
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The second assumption extends the con c ept o f local asymptotic normality (LAN) (required 
for the parametric BvM theorem in LeCam (1 953 )) to the semiparametric context. Denote the 
log-likelihood by l n {9,rj). By Fubini’s theorem, the marginal posterior for 6 can be written as 


U(9eA\X 1 ,...,X n ) = 




exp ( l n (0,V ) - l n (e 0 ,Vo))dU e H (r]) Idll©^) 


'© 




exp (l n {9, rj) - lni.0 o ,T] O ))dU^(r]) Ull©^). 


(3.1) 


(3.2) 


Therefore, the integrated likelihood ratio S n (9 ) defined by the map 

S n (0)= [ exp (l n (6,rj) - l n {9o,r]o))dU d n {g), 

Jn 

plays a similar role as the likelih ood ratio in the par ametric model. To prove the first order 
semiparametric BvM theorem, IBickel and Kleijnl (2012) assume that for every random sequence 
{K} of order Op 0 (l), 


log 


S n {0 o + n 1/2 h r 
S n (0 O ) 


— hn9n (^)’ 


(3.3) 


where g n = (l/>/n) E"=i 4001 ^ N o (0, J 0 ). 

Accompanied with ()3.3I) . IBickel and Kleiinl (20121) further require the marginal posterior of 9 to 
converge at root-n rate. In many cases, it may require significant effort to verify this parametric- 
rate condition. To avoid such a stringent assumption as well as keep track of the higher-order 
remainder, we introduce the notion of the localized integral likelihood ratio as follows: 


S r 


(0) = [ exp [l n (0,ri) - Z n (6>o,77o))dII^(77). 


(3.4) 


The information in the localization sequence T-L n , e.g^, \\ri — VnWn < Mp n and n £ J v?, will be 
utilized in the application of the maximal inequality ( van der Vaart, and Wellner . 199fil . Corollary 
2.2.5) to provide a uniform bound. More importantly, when these conditions are combined with 
Assumption [H we no longer need to assume the root-n marginal convergence rate for 0. 


Assumption 2 (Second order integrated LAN). There exists a nondecreasing function R n (-) : 
M —» M satisfying sup tg j n -i /2 R n (t)/nt 2 —> 0 for each M > 0 such that for every sequence 9 n 
satisfying 0 n = 9 0 + o Po (l), 

log =M0n - 9o) T gn 9o) T I O (0n - 9 0 ) + O Po (R n (\0 n - 0 O Q). (3-5) 

S n {9o) 1 

Note that (13.51) can be written in the form of (13.31) by re-parameterizing 9 n as 9 o -| -n~ l t 2 h n . In 
the sequel, we name (|3.5I) as ILAN. A typical R n (t) is dominated by y/nt 2 + \Jnp 2 n \ see the examples 
in Section [5] and their proofs. 

Now, we are ready to present the main theorem in this paper. 


Theorem 3.1. We assume the prior for 9 has a Lebesgue density that is continuous and strictly 
positive at 9 o and the efficient information matrix Io is invertible. Suppose X\,, X n are i.i.d. 
observations sampled from Pq. Under Assumptions U\ and [H we have 

sup |n(0 € A\X\, ...,X n )~ N p (0 o + n _1 / 2 A„, (n/o)' 1 )^)! = 0 Po (S n ), (3.6) 

A 

where A ranges over all measurable subsets of 0 and S n = R n {n~ 1 ^ 2 logn) + 5 n . 
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We remark that Theorem [3TT] can be easily adapted to a non-asymptotic version (by invoking (1A.14D 
in Lemma lA.21) if Assumptions [T] and [2] are stated in a non-asymptotic manner. 

The logn term in S n is not essential and does not affect the polynomial order of S n , which 

is our main interest. We need it in the proof such that the posterior probability of the event 

{6 : Mn -1 / 2 logn < 1 0 — #o| e™} decays at a faster rate than S n for a sufficiently large M. 

We comment that Assumption [2] is implied by the following conditions: (Al) on the semi- 

parametric model; and (A2) on the prior. Specifically, Lemma 13.51 in Section 13.31 shows that 
R n (-) = G n (-) + G n (-), where G n and G n are given in (Al) and (A2), respectively. Note that H n 
in Conditions (Al) and (A2) is the same as that in Assumption |T| 

(Al) (Stochastic LAN) There exists an increasing function G n : R —>• [0, oo), such that for every 
sequence {9 n } satisfying 8 n = 6q + op 0 (l), 


sup 

r/GHn 


ln{Qn,V + Ar/((9 n )) - l n {0o,ri) - (0 n - 6» 0 ) T ^4(^i) 


i= 1 


+ 2 n (@n ~ 0 O ) lo{0n ~ Oo) 


= 0 Po (G n (\6 n -6o\)). 


(3.7) 


If we set r/ = r]o in (13.71) . then we obtain the LAN for the least favorable submodel l n [6 n , p* ( 6 n )). 
A typical G n (t) in ()3.7[> is dominated by y/nt 2 + y/np 2 . For example, see the verification of (Al) 
in the proof of Theorem 15.11 

Condition (A2) characterizes the prior stability under a small perturbation in the likelihood 
function caused by A p(0 n ) in the nuisance part. 

(A2) (Prior stability under perturbation) There exists an increasing function G n : R —» [0, oo), 
such that for any 6 n = 6q + op 0 (l), 


fun ex P( l n{Oo,ri - Ap(O n )))dn e rf (p) 

Jn n ex p(ln(0o, 1]))dH e ^ (ji) 


l + 0 Po (G n (\O n -0 0 \)). 


Condition (A2) is crucial for proving the root-n convergence rate of 0 — whose failure is typically 
caused by liminfn^oo G n (n~ 1 / 2 logn) > 0 (see the numerical study in Section r5.1.3l) . In fact, we 


call a nonparametric prior an unbiased one if lirn 


■n~>oo 


G v 


n 


_1 / 2 logn) = 0 since it corrects the 
semiparameti'ic bias Ap in (A2). In the special case fJBickel, 1982) that {Pe t no • 0 € 0} forms a least 
favorable submodel, i.e., Ap = 0, (A2) automatically holds when independent priors are assigned 


for 0 and p. However, in the general case where Ap / 0, we typically have A p(0 n ) = O(\0 n — #o|) 
(see (A3) in Section S]) and that 


exp {ln(0o,V- Ap(0 n )) - l n (Oo,p)} = O Po (n\0 n - 0 o \p n ) 


does not converge to zero. Therefore, under independent priors, (A2) cannot be implied by bounding 
the ratio between integrands in its denominator and numerator unless we are willing to impose 
additional conditions such as (A5) and (A6) in Section fPl 


3.2 Second Order Bayesian Inference 

In practice, we can employ an MCMC algorithm to efficiently draw a sequence of samples {0^ : 
l = 1 ,,L} from the marginal posterior distribution of 0 = (0 1 ,..., 0 p ), based on which Bayesian 
estimators and credible regions can be constructed. Their frequentist validity together with second 
order properties can be rigorously justified by our Theorem 13.11 For example, Theorem 13. II directly 
implies the semiparametric efficiency of the posterior median as follows. 


7 














Corollary 3.2. Consider the semiparametric model and the prior II in Theorem \3.1\ Under the 
same assumptions, the coordinate-wise marginal posterior median 9® satisfies 

Vn(Qn - °o) = A n + 0 Po (S n ), 
where A n N p (0, Iq 1 ) and S n = R n (rT x l 2 logn) + 5 n . 

The conclusion in Corollary 13.21 may also hold for posterior mode, but this would require the 
convergence of posterior density instead of posterior distribution as in Theorem l3.ll 

We next study the frequentist property of credible regions. For any a G (0,1), we define the o-th 
marginal posterior quantile q S}Ce of 6 S through the following equation n(0 s < q S}Cl \X\,... ,X n ) = a. 
Let (—oo,(p Q ] be a one-sided confidence interval for 9 S of significance level a based on the sth 
component of the best regular estimator, which is well approximated by 0 o+A„/ \Jn. In other words, 
q s ,a is given by 0 O , S + A n , s /y/n + n _1 / 2 (7^ s )i/ 2 z Q so that Pq (0o,s A Qs, a) —>■ ot as n —>■ oo. Here Iq s 
is the (s, s)-th component of Iq 1 , 0o,s and A UjS are the s-th components of 0o and A n , respectively. 
The following corollary suggests that the credible interval (—oo,g Si i_ Q ] {\fi s ,a/ 2 i Qs, 1-0/2]) estimates 
this one-(two-)sided confidence interval for 6 S of significance level (1 — a) with an error of order S n . 

Corollary 3.3. Consider the semiparametric model and the prior n in Theorem \3.1\ Under the 
same assumptions, we have y/n\q s , a ~ 9s,a\ = Op 0 (S n ) for s = 1,... ,p. 

Remark 3.1. The MCMC samples can also be used to construct an estimator of the asymptotic 
variance V* (or the efficient information matrix Iq), denoted as V*. As shown below, we have 
|| V* — V*\\f = Op 0 (S n ) and IKH*)^ 1 — /o||f = Op 0 (S n ), where || ■ ||i? is the Frobenius norm. The 
diagonal element Vf s can be estimated by V* s = \Zn{qs,i-a/ 2 ~Qs,a/ 2 )/(^ z i-a/ 2 )> where z a is the ath 
quantile of a standard normal distribution. According to the proof of Corollaru \3.3l we have q s . a = 
0o,s + n _1 / 2 A niS + n _1//2 (V r s * ) x ^ 2 z a + n -1 / 2 Op 0 (S n ), which implies Vf s — Vf s = Op 0 (S n ). For the off- 
diagonal element V* s , (s s'), we can first obtain the ath quantile q s ,s',a for the marginal posterior 
distribution of •& = 9 s + 9 s t and then set V* s , = Vn{q s ,s',i-a/2 ~ Qs,s', a /2)/(2z 1 _ a / 2 ) ~ V* 3 - V*, 3 ,}. 
Since equation implies that 

sup \ll(9 6 A\X\,.. .,X n ) - N(9 0 ,s + 0 O , S ' + n _1/2 A niS + n _1/2 A njS /,n _1 S)(H)| = 0 Po (S n ), 

A 

where E = Vf s + V*, , + 2V*,, we obtain V*, = V*, + Op 0 (S n ). This proves our previous claim. 


3.3 Verification of Assumptions [T| and [2] 


We verify Assumption [T] in a general class of statistical models V = {P^ : A G J-}, where the 
observations Y^ = (Y \,..., Y n ) are independent but not necessarily identically distributed. Hence, 
we have Pj "' 1 (Y^P) = n)Li P\,i(Xi) with P\,i the marginal distributi on of Y^ u nder a com mon 


param eter A (whose true value is denoted as Ao). In the above setup, Ghosal and van der Vaart 
(20 o 3) derived the posterior contraction rate of A as being at least (in terms of a senri-metric 
d£(A,A') = \ Ya=i f(y/P\i ~ y/Pxfii) 2 dUi for any pair (A, A') in P) by showing n(d n (A,A 0 ) > 

Mf n \X \,... ,X n ) = o ( „)(1). In Lemma l3~4l below, we obtain an exponential convergence rate of 

A o 

n(d n (A, Ao) > M£ n \X \,..., A n ) by keeping track of the remainder term in the proof of Theorem 
4 therein. 

Lemma 13.41 is also of independent interest. Denote V 2 (P,Q) = f | \og(dP/dQ) — K(P,Q)\ 2 dP 
as a discrepancy measure between two probability measures P and Q. 







Lemma 3.4. Let £ n be a sequence satisfying — >• 0 and nf 2 —>• oo. // t/iere exists an increasing 

sequence of sieves F n C J 7 , such that the following conditions are satisfied: 

a. U(F\F n ) < exp(— n^ 2 (C + 4)) for some C > 0; 

b. log N(£ n , T n , d n ) < nf 2 ; 

c. n(B n (pW,£ n )) > eM-Cntt), 

where B n (P^"' 1 , £ n ) = |a € F : K(Pq 1 \p^) < nf) 2 , V 2 (Po n \P^) < n^|, then for some constant 
C\ > 0 and large enough M, we have 


n(d n (A, Ao) > M£ n \X u .. ..x n ) = O Cn) (exp(-Cin^)). 


A o 


(3.8) 


In semiparametric models, the sieve sequence P n typically consists of one parametric part 
and one nonparametric part. For example, F n = F% ® Fn = {9 T u + rj(v ) : 6 € F%, p € Fn} 
in the class of GPLM. By viewing ( 6 , r/) as A in the above lemma, we can conclude that the 

posterior probability of the event {|| U T (9 — 9q) + rj — rj o|| n < M£ n } is 1 — O ( ra )(exp(—Cin^)) if 

A o 

d n ( A, Ao) dominates || U T (9 — 9 o) + rj — rj o|| n . In partially linear models, we can further show that 
{\9 — #o| < c^ n , ||?7 — 7/o||n < c£ n } for some constant c > 0 given that the matrix Pq(U — E[U\V])® 2 
is invertible. Please see Lemma IA.3I and the arguments after that. In this case, we know that 
p n and e n in Assumption [T] turn out to be £ n given in Lemma 1X41 (and 6 n = exp(—Cin£^)). As 
a by-product of Lemma [3741 we show that Il(A 0 F n \X i,..., A n ) = O p (n> (e~ Cin ^ n ) by following 

- --n A 0 

Lemma 1 in Ghosal and van der Vaartl (120071 1. In the end, we remark that Lemma 13.41 does not 
apply to ge neralized partial l inear models. R ather, we verify Assumption [T] by directly applying 
Lemma 2 in Ghosal and van der Vaart ( 2007 ): see Lemma lA. 41 

We next discuss the sufficient condition (Al) for Assumption [2j Note that (Al) depends on the 
prior through the localization sequence {%n} in Assumption [TJ to which the posterior distribution 
allocates most mass. With a small subset LL n , the L.H.S. of (|3.7j) converges to zero at a faster rate. 
Hence, we want to make 7~L n as small as possible while keeping n('H„|Xi,..., X n ) close to one. 
Motivated by this, we set 


1-L n = {q : ||?7 - 77o|| n < Mp n } n F%, 


(3.9) 


where {F%} is the sieve sequence constructed in Lemma 13.41 By Assumption [T] and condition 


(a) in Lemma f3~4l we obtain that TL(7-l n \Xi, ..., X n ) = 1 — Op 0 (6 n ) with 6 n = e nf>n . Then we 
can bound the L.H.S. of (13.71) from above by calcu lating the continuity modulus or applying the 


maximal inequalities in van der Vaart and WellnerJ ( 19961 ): see Lemma lA.21 Please see Section [5] 


for the verification of (Al) in concrete examples. 

Now we are ready to state our lemma for Assumption [2j 


Lemma 3.5. If (Al) and (A2) hold, then we have the following R n = G n + G n in Assumption [H 


4. Semiparametric Prior 

In this section, we consider two classes of priors, differing in whether 9 and rj are dependent, 
and then specify the corresponding form of G n (-) in the prior stability condition (A2) for them. 
In general, in applying the semiparametric BvM theorem we find that the dependent prior has 
advantages in requiring less stringent conditions and being adaptive to the unknown smoothness 
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of the nonparametric function. Throughout this section, we impose a smoothness condition on the 
least favorable curve: 

(A3) There exists a function h* € L 2 (Po), referred to as least favorable direction, such that 
Aq(0) = {0- e 0 ) T h* + O(\0 - 9 0 1 2 ) as 6 -» 00- 

Note that (A3) is commonly assumed in the literature. For example, it holds for the class of GPLM 
under mild conditions; see Lemma lA.il 


4.1 Independent Prior 

Consider a pair of independent priors: 

(PI) 9 ~ n e , r i ~ U H . 


This is a common choice in the semiparametric Bayesian literature with various forms of H^. For 
example, Kim ( 20061 ) considered a class of neutral-to-the-right proc e ss prio rs for the cumulative 
hazard function in the C ox proportional hazard model, w hile Castillol ( 2012bl ) considered a class of 
Gaussian process priors (jRasmussen and Williamsl. 120061) for the sam e model. Another example is a 
Riemann-Liouville type prior considered by Bickel and Kleijn ( 20121 ) in the partially linear models. 

We next specify the form of G n {-) under the above independent prior. For technical reasons, 
we need to introduce a sequence of approximations to the least favorable direction h*, denoted as 
{h n }. Let II H,—g represent the distribution of W — g for W ~ II p and a function g, and define 
f n = dU H ._^ ri _ do ' ) T hri /dIlH as the Radon-Nykodym derivative. For any set A C H and element 
/ E H, let A — / denote the set {g — f : j £ A}. For e n and 5 n specified in Assumption [TJ we 
assume that 

(A4) There exists a nondecreasing function G n : R —>• M, such that for any 9 n = 9 q + Op 0 (e n ), 


sup 

ri&Hn 


ln(0 Ol r]- A r)(9 n ) + (0 n - 9 0 ) T h n ) - l n {0 o ,r]) = Op 0 (G n (\9 - 0 O |)). 


(A5) For any 9 n satisfying \9 n — 0q| < e n , we have 

n H {r] eHn- (6 n - 9 0 ) T h n \X 1 , ...,X n ) = l- Op 0 (5 n ). 

(A6) For any 9 n = 0 O + Op 0 (e n ), \ log f n (rj)\ = Op 0 [G n (\9 n - 0 O |)] holds with q ~ U H . 

(A4) characterizes the robustness of Z n (-) against a small perturbation in q. In fact, by Condition 
(A3), we have A q(9 n ) — (9 n — 9 q) t h n = (0 n — 9) T {h* — h n ) + 0(\9 n — $o| 2 )- Hence, Condition 
(A4) is expected to hold if h n is sufficiently close to h*. Similar to (A4), (A5) characterizes 
the concentration stability of the localization sequence {Hn} against a small perturbation in q. 
This stability can be easily obtained by slightly enlarging the localization sequence via H n <—> 
U| 0 - 0 o |< £n {H n — (9 — 9 q ) T hn}- For simplicity, we tacitly assume that this enlargement is always 
made for H n . As we will clarify in the proof of Theorem 15.11 this enlargement only increases the 
covering entropy of H n by a negligible amount proportional to log^" 1 ), which will not affect our 
results. (A6) characterizes the robustness of the marginal prior n# against a small perturbation. 
The reason for introducing the approximation sequence {h n } is that the Radon-Nykodym derivative 
| log f n ( 77 ) | in (A6) might have peculiar behavior at h n = h*. As an example, we consider the 
partially linear model in Section [5] where a Gaussian process (GP) prior is assigned. If we 
set h n as h* , then we have to require h* € such that | log/ n ( 77 ) | converges to zero. This 

2 H is the reproducing kernel Hilbert space (RKHS) associated with the assigned GP 
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requirement is very strict since H is often a very small subset of PL . Fortunately, we can always find 
an approximation sequence {h n } C H under which condition (A6) is satisfied. Note that a si milar 
cond ition to (A6) is also required for the first order semiparametric BvM theorem; see ICastillo 

(|2012hh . 


To verify the stability condition (A2), we can decompose the semiparametric bias A p(9 n ) into 
two components: A rj(9 n ) — (9 n — 9fi) T h n and (9 n — 6o) T h n . The former can be dealt with (A4) 
through likelihood and the latter by (A5) and (A6) through the localization sequence and the prior. 
This is summarized in the following lemma. 


Lemma 4.1. Suppose that Conditions (A4), (A5) and (A6) hold. Then the pair of independent 
priors (PI) satisfies (A2) with G n (t ) = G n (t) + 6 n . 


4.2 Dependent Prior 


In this section, we construct a class of dependent priors (II©,]!^). Dependent priors facilitate 
the development of adaptive Bayesian procedures that do not require knowledge of the smooth¬ 
ness of rj in specifying H e H . This adaptiveness is achieved by correcting the 0-dependent bias 
A7/(0) in the prior construction. We remark that adaptiveness cannot be achieved by the inde- 
pen dent pr iors (see Sect ion lA.lj) , and this finding is consistent with the negative observations in 
Rivoirard and Rousseau ( 2012l l for linear functionals of densities. 

Let h n be an estimator of the least favorable direction h* that satisfies (A4) - (A5) with h n = h n . 
Again, by (A3) we have Arj(9 n ) - (6 n - 0 o ) T h n = (9 n -0 O ) T {h*-h n )+ 0(\9 n -0 O \ 2 ). Consequently, 
(A4) is implied by the following condition with G n (t) = np n n n t + np n t 2 : 


(A7) The estimator h n of h* satisfies \\h n — h*\\ n = Op 0 {K n ), n n —> 0. Please see concrete 


examples in Section [5] for more discussion on Condition (A7). Let IIq be a marginal prior for 0 
that satisfies the condition in Theorem 13.II and 11% a prior for p. Consider the following joint prior 
distribution for (9,p), 


(PD) 


9 ~ n e , p\9 ~ W + 9 T h n with W ~ 11%. 


The conditional prior distribution IT^ of p given 9 is obtained by shifting the center of 11% by 
a 0-dependent amount, i.e., 9 T h n . By introducing this dependent structure, we can compensate 
for the semiparametric bias without imposing Condition (A6). In the end, we remark that the 
randomness of h n only enters equation (13.51) in Assumption [2] through the remainder term, and 
thus can be decoupled from the randomness in the leading terms of equation (|3.5D . Hence, the 
proof of Theorem 13. II still goes through even though (PD) is data-dependent. This is an appealing 
feature of the proposed prior; our theory shows that we do not need to split the sample and apply 
a two stage approach to obtain a valid characterization of uncertainty. This is backed up by our 
simulations. 


Lemma 4.2. If conditions (Af) - (A5) are met with h n = h n , then the dependent prior (PD) 
satisfies (A2) with G n = G n + 5 n . 


4.3 Second-order BvM Theorem under Independent/Dependent Prior 

We summarize the discussions on independent prior (PI) and dependent prior (PD) in the following 
theorem, which is a straightforward application of Theorem 13.11 
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Theorem 4.3. Suppose X\,..., X n are i.i.d. observations sampled from P$ = Pe 0 ,r) 0 - Suppose 
that Assumption Q Conditions (Al) and (A3) hold and the prior for 6 is dense at 9 q. We further 
assume Conditions (A4) ~ (A6) for the independent prior (PI) and Conditions (Af) - (A5) for 
the dependent prior (PD). Then the marginal posterior for 0 has the following expansion in total 
variation as n —>• oo, 

sup |n(0 € A\X U .. ,,X n )-N p (9 o + n- 1 / 2 A n , (n7 0 ) _1 )(^)| 

A 

= 0 Po [G n (n~ 1/2 logn) + G n (n~ 1/2 log n) + <5„]. 

5. Examples 

In this section, we construct specific priors for three semiparametric models: partially linear model 
(PLM), GPLM and the Cox regression model. In PLM, we consider two scenarios: (i). the 
smoothness of the nonparametric part p is known; (ii). the smoothness is unknown and an adaptive 
marginal prior is assigned to p. The non-adaptive and adaptive results obtained in PLM can be 
easily generalized to GPLM. We assign GP priors for the first two models and a Riemann-Liouville 
type prior for the last model. 


5.1 Partially Linear Model 

5.1.1 Non-adaptive Bayesian Procedure 


We start with a pair of independent priors. In principle, the marginal prior for the parametric part 
6 can be any continuous distribution with full support over 0. For computational convenience such 
as conjugacy, we specify IIq as a multivariate normal distribution IV(0, I v /4> o) with <f>o the precision 
parameter. For example, one can choose = 0.01 to induce a vague prior for normalized predictors. 
For the nuisance part, we choose 11% as a stat ionary Gaussian process (GP) prior GP(m,K a ) 
indexed by an inverse bandwidth parameter a ( van der Vaart and van Zanten . 2009 ). Here, the 
notation GP(m, K) denotes a Gaussian process with mean function m : W l —> R and covariance 
function K :M. d x W l —>• R. The scaled covariance function K a is defined as K a (x, y ) = Kq(ox, ay ), 
where Kq is a base covariance functiorH. Through this section, we focus on the squared exponential 
covariance function Kq(x, y) = exp(— \x—y\ 2 ). We next discuss the choice for the inverse bandwidth 
parameter a given the knowledge of the smoothness of the nuisance function, denoted as a. Given n 
independent o bservations, th e minimax rate of estimating a d -varia te a-smooth function is known to 
be n ~ a /i 2a + d ) ( Stone . ~1982l ). van der Vaart and van Zanten ( 20091 )) showed that with a n = n l T 2a + d ) 
the Gaussian process prior GP(0, K an ) leads to the minimax rate up to a logn factor. Hence, we 
set a n = n’/( 2a+d ) in this subsection. 

We next focus on the dependent prior (PD). The least favorable direction h*(-) in this model 
is essentially —E[U\V = ■], which can be directly estimated based on the design points {(t/j, Vi)}, 
e.g. by kernel method. Denote this estimator as h n . Since shifting the center of GP is equivalent 
to translating its mean function, we can write the dependent prior (PD) as 


6 ~ n 0 and p | 9 ~ GP{9 T h n , K an ). 


By writing (PD) in this form, we can discuss its relation with independent priors. If we reparam¬ 
eterize the nuisance parameter by £ = p — 9 T h n , then £|0 ~ GP(0, K an ) and the partially linear 

3 For the covariance function K a , we use H a and || • || a to denote the associated RKHS and RKHS norm, respectively. 
The unit ball in the RKHS H“ is denoted by H“. 
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model becomes 


Y = e T [U + h n (V)\+Z(V) + w, (5.1) 

with true values 6q and £o := Vo ~ d//h n . If we treat U + h n (V) as a new covariate U, then the least 
favorable direction of the new model becomes 

h = E[U\V} = h n - h* = 0 Po {K n ) A 0, 


where n n is defined in (A7). This suggests that the semiparametric bias A£(0) of the new model is 
negligible. 

Theorem 5.1. Let X j = ( Ui,Vi,Yi ) € R p x R rf x R, i = 1 be n observations from the 

partially linear model (ESI). Assume the following conditions: 


(i) . rjo is Holder a-smooth, where a > d/2. 

(ii) . The information matrix Iq = Pq(U — E[U\V])(U — E[U\V]) T is invertible. 

(iii) . For the independent prior, the conditional expectation E[U\V = v] as a function of v is 

at least a-smooth; for the dependent prior , (A7) holds with K n = Op 0 (p n ), where p n = 

H ot/{2oi-\-d) ^Qg yj)l+^_ 

Then with the choice of a n = n 1 /( 2a + d ) ! we have the following second order BvM result: 


sup |II(0 € A|A (n) )-IVp(0o+n 1/2 A n , (n/ 0 1 ))(A)| = Op 0 ( Vnp 2 n log n) = 0 Po (n A (logn) 2d+3 ), 

A 

(5.2) 

where A n = n" 1 / 2 YJU " E[U\Vi\) S N p (0,I^). 

If the smoothness of the GP does not match with the smoothness of the regression function, 
i.e. a n = n 1 /( 2 “ ,+rf ( with a' a , then the convergence rate of the nuisance parameter provided by 
Theorem 15. II becomes suboptimal: p n = n~ a ^ 2 “ +d ' > when a' < a and p n = n~ a ^ 2a +d ^ when a' > 
a. Therefore, it is crucial to choose a proper nonparametric prior for obtaining a better frequentist 
accuracy of the semiparametric Bayesian procedure. In the end, we re mark that the remainder term 
in th e above fully Bayesian framework matches with that derived in Cheng and Kosorok ( 2t)08al . 
2009h with the nonparametric part profiled out. Note that Cheng and Kosorokl ( 2008a bT 20091 ) Is 
not a fully Bayesian framework, and did not cover the adaptive case. However, the adaptiveness 
can be easily incorporated into the construction of our nonparametric Bayesian prior as will be 
seen in the next section. 


5.1.2 Adaptive Bayesian Procedure 

In the adaptive case, we still specify a GP prior for n^r. To allow adaptation to the unknown 
smoothness a, we follow van der Vaart and van Zanten ( 20091 ) by putting a prior on the inverse 
bandwidth A. van der Vaart and van Zanten ( 20091 ) showed that the hierarchical prior 


W a \A~ GP(0,K A ), A d ~Ca(a 0 ,bo) 


(5.3) 


with Ga(ao,bo) the Gamma distribution whose pdf p(t) oc £ a o-i e -M leads to the minimax rate 
n ~a/( 2 a+d) U p \ Q g n f ac tors, adaptively over all smoothness a > 0. Since the choice of hyper¬ 
parameters has a diminishing impact on the posterior distribution as the sample size n grows, we 
simply choose cjq = 1 and bo = 1. 
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Under the above choice for II//, Condition (A6) becomes overly stringent, suggesting the incom¬ 
petence of the independent prior in the adaptive scenario. Please see Section [A. II in the Appendix 
for further explanation. Fortunately, the dependent prior avoids (A6) by incorporating the bias 
correction as follows: 

6* ~ II©, A d ~ Ga(a 0 ,b 0 ), 

~ . (5.4) 

ri\e,A~GP(6hn,K A ). 

The proof of Theorem 15.21 is omitted due to its similarity to those of Theorems 15.11 and IA.1I 

Theorem 5.2. Let Xi = ( Ui,Vi,Yi ), i = 1 be a sample from the partially linear model. 

Suppose that h n is an estimator of the least favorable direction h*(-) = —E[U\V = ■], and conditions 
(i)-(iii) in Theorem I5.il hold for the dependent prior. Then under the prior (15.41) . the following 
second order BvM result holds: 

sup |n(0 € A|A (n) ) - N p (6 0 + n~ 1/2 A n , (n7 0 " 1 ))(A)| = 0 Po {n~^+* (logn) 2d+3 ). (5.5) 

A 

Note that the remainder term in Theorem 15.21 is exactly the same as that in Theorem 15.11 
However, we cannot claim that the adaptive procedure does not lead to any loss of the second 
order Bayesian efficiency because the remainder term is not proven to be sharp. 

5.1.3 Simulation Results 

In this section, we conduct a simulation study for comparing the dependent and independent prior 
in the adaptive scenario. In each setting, we generated 100 datasets from the following four models: 

Ml Y t = 0.5 Ui + exp(U;) + IV(0,0.5 2 ), with V): ~ N{ 0,1) and U,\V t ~ N(0.5|U| 3 ,1); 

M2 Yi = 0.5 Ui + exp(U) + IV(0,0.5 2 ), with V) ~ JV(0,1) and - N(0.5V?, 1); 

M3 Yi = O.bUi + exp(|V)|) + 1V(0,0.5 2 ), with V) l ~ N(0, 1) and ~ IV(0.5|U| 3 ,1); 

M4 Yt = 0.5 Ui + exp(|U|) + iV(0,0.5 2 ), with V t ~ JV(0,1) and U,]Vi ~ A^(0.51^ 3 ,1). 

In Ml, the least favorable direction h*(v) = 0.5|u| 3 is twice differentiable but not thrice differentiable 
at v = 0. In contrast, the least favorable direction h*(v) = 0.5u 3 in M2 is infinitely differentiable. 
M3 and M4 are counterparts of Ml and M2 respectively with non-differentiable nuisance parts at 
v = 0. As for assigned priors, we consider three different setups: PI. the independent prior with Wh 
specified by (15.31) : P2. the dependent prior (15.4|) with an estimator h n (v) produced by the Nadaraya- 
Watson kernel regression methoc0; P3. the dependent prior (15.4(1 with h n (y) = —E(U\V = v). In 
each, we chose a vague prior IV(0,10 2 ) as n©, and hyper-parameters «o = &o = 1- For each replicate, 
we ran MCMC for 10,000 iterations and discarded the first 5, 000 as the burn-in. 

The results for Ml and M2 are displayed in Table [lj We varied the sample size n from 50 to 400 
and applied the three priors PI, P2 and P3. We record the root mean squared error (RMSE) for 9 
(under the Euclidean norm) and r/ (under the empirical norm), respectively, across 100 replicates. 
The average estimated standard error based on MCMC (SE) and the empirical coverage of nominal 
95% credible intervals based on MCMC (CR95) are also reported. From Table [TJ we can see that 
the estimation accuracy of 9 (in terms of RMSE) improves under the dependent priors P2 and P3 as 
n grows. However, the RMSE for 9 under the independent prior PI only significantly decreases as n 

4 We apply the Gaussian kernel with an optimal bandwidth dBowman and Azzalinil . 1 19971 . p.31) 
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goes from 50 to 100, and remains around 0.1 thereafter. On the other hand, the estimated standard 
errors produced by PI - P3 are all very close. The CR95 results further illustrate the significant 
under-coverage of the credible intervals produced by PI. All of these empirical observations justify 
the existence of semiparametric bias (discussed after Lemma 13.51) . and illustrate the necessity to 
compensate this bias by using the dependent priors. Moreover, we observe that the RMSE for 9 
intimately depends on that for r/: a large RMSE for r/ usually leads to a large RMSE for 9, which is 
consistent with our theory. For example, Bayesian estimation accuracy of 9 is higher in Ml than in 
M2. Another observation from Table |T] is that as n increases, the difference in estimation accuracy 
between P2 and P3 becomes negligible. This might be attributed to the increasing accuracy of the 
estimation of h n . 



model 

method 

RMSE(0) 

SE 

RMSE(?7) 

CR95 




PI 

0.115 

0.078 

0.308 

0.92 



Ml 

P2 

0.085 

0.082 

0.274 

0.96 

n = 

50 


P3 

0.083 

0.083 

0.270 

0.95 


PI 

0.104 

0.080 

0.298 

0.84 



M2 

P2 

0.084 

0.085 

0.267 

0.96 




P3 

0.082 

0.085 

0.268 

0.96 




PI 

0.103 

0.052 

0.225 

0.83 



Ml 

P2 

0.056 

0.056 

0.202 

0.95 

n = 

100 


P3 

0.053 

0.056 

0.204 

0.96 


PI 

0.096 

0.051 

0.235 

0.85 



M2 

P2 

0.055 

0.054 

0.209 

0.94 




P3 

0.051 

0.055 

0.206 

0.97 




PI 

0.106 

0.038 

0.230 

0.62 



Ml 

P2 

0.042 

0.038 

0.197 

0.93 

n = 

200 


P3 

0.036 

0.038 

0.187 

0.97 


PI 

0.094 

0.036 

0.209 

0.72 



M2 

P2 

0.038 

0.038 

0.180 

0.95 




P3 

0.038 

0.038 

0.183 

0.98 




PI 

0.115 

0.035 

0.289 

0.38 



Ml 

P2 

0.030 

0.028 

0.187 

0.93 

n = 

400 


P3 

0.025 

0.028 

0.187 

0.98 


PI 

0.107 

0.033 

0.268 

0.45 



M2 

P2 

0.030 

0.027 

0.178 

0.92 




P3 

0.027 

0.026 

0.179 

0.98 


Table 1: Simulation results for the partially linear model with a smooth nuisance function based 
on 100 replicates. 


Table [2] provides the results for M3 and M4, where the nuisance function is non-differentiable. 
As expected, the overall RMSE in Table [2] is worse than that in Table [U However, similar overall 
trends as those in Table [2] are observed. For example, the estimation accuracies of PI are generally 
worse than those of P2 and P3, and the semiparametric bias in PI is more salient under M3 and 
M4 than under Ml and M2. In addition, the RMSE for 9 produced by PI under a non-smooth least 
favorable direction h* is significantly worse than the RMSE under a smooth h*. This is consistent 
with condition (A7), because the semiparametric bias under the independent prior (PI) depends 
on the smoothness of h*. 


5.2 Generalized Partially Linear Model (GPLM) 

The semiparametric BvM results for GPLM are similar to those for PLM. Hence, we only focus 
on the more challenging adaptive scenario in this section. In particular, we consider the same 


15 


















model 

method 

RMSE(6») 

SE 

RMSE(?7) 

CR95 




PI 

0.243 

0.088 

0.499 

0.74 



M3 

P2 

0.090 

0.085 

0.279 

0.94 

n = 

50 


P3 

0.084 

0.087 

0.280 

0.97 


PI 

0.194 

0.089 

0.408 

0.80 



M4 

P2 

0.084 

0.087 

0.270 

0.97 




P3 

0.084 

0.087 

0.265 

0.97 




PI 

0.217 

0.064 

0.441 

0.67 



M3 

P2 

0.061 

0.056 

0.233 

0.93 

n = 

100 


P3 

0.057 

0.056 

0.231 

0.93 


PI 

0.122 

0.052 

0.309 

0.84 



M4 

P2 

0.059 

0.055 

0.221 

0.96 




P3 

0.058 

0.055 

0.219 

0.95 




PI 

0.189 

0.036 

0.410 

0.53 



M3 

P2 

0.042 

0.039 

0.215 

0.94 

n = 

200 


P3 

0.042 

0.039 

0.212 

0.97 


PI 

0.106 

0.042 

0.271 

0.77 



M4 

P2 

0.041 

0.038 

0.204 

0.98 




P3 

0.040 

0.038 

0.203 

0.97 




PI 

0.194 

0.041 

0.429 

0.21 



M3 

P2 

0.035 

0.029 

0.207 

0.95 

n = 

400 


P3 

0.031 

0.028 

0.205 

0.95 


PI 

0.115 

0.033 

0.282 

0.65 



M4 

P2 

0.033 

0.028 

0.193 

0.94 




P3 

0.030 

0.028 

0.193 

0.96 


Table 2: Simulation results for the partially linear model with a non-smooth nuisance function 
based on 100 replicates. 


dependent prior as in Section [5.11 i.e., GP with a random inverse bandwidth parameter. Define 


m 


dm 

dt, 


m 


m 

mor 


^ G R, 


fo = f{go) and l 0 = l(g 0 ). 

Theorem 5.3. Let Xj = ( Ui,Vi,Yi), i = 1 be a sample from GPLM satisfying Assump¬ 

tion^ in Section \A.A Suppose Condition (i) - (in) for the dependent prior in Theorem \5.1\ hold. 
Moreover, assume 


(ii’). The information matrix L 0 = E 0 [l 0 (T)f 0 (T)(U + h*(V))(U + h*(V)) T ] and the identification 
matrix Pq(U — E[U\V])(U — E[U\V]) T are invertible. 


Then under the dependent prior (15.41) , the following second order BvM result holds: 


sup |n(0 G A\xW) - N p (0 o + n-V 2 A n , (n/o” 1 ))^)! = 0 Po (n-W (log n) M+3 ), 


where A n = n~ V 2 E^=i + h*{Vi )) - ^(O,^ 1 ) 


5.3 Cox Proportional Hazard Model 

In this section, we revisit the Cox proportional hazard model with current status data in Section [2731 
Recall that we use notation 6 and A to denote the parametric part and nuisance part in the model, 
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respectively, and that the least favorable direction h* is given by 


h*(c) 


E[ZQI qM {X)\C = c\ 
E[Q 2 e 0 ,A 0 (X)\C = c] 


(5.6) 


Assume that the true baseline hazard function Ao is Lipschitz continuous and uniformly bounded 
away from zero. Then, by repara metrizing log A as the nuisance function ij, we assign the following 
Riemann-Liouville type prior IT f; (jvan der Vaart and van Zantenl . l2008al . Section 4.2) 


r(t) 


,t 2 

/ {t-uf^dWu + Y^Zkt 
Jo k =o 


0 < t < T, 


(5.7) 


where Zk iV(0,1), k = 0,1,2. The prior II© can be chosen as any distribution with positive pdf 

everywhere over M p . 

Theorem 5.4. Let Xj = (Ci,8i, Zi), i = 1, ■ ■ ■ ,n, be a sample from the Cox model with current 
status data. Assume that h* given by (15.61) is Lipschitz continuous and Ao satisfies Aq(t) < M for 
some constant M. Then under the independent prior II© x 11^, the following second order BvM 
result holds: 


sup |n(0 G A\xW) - N p (e 0 + n- 1 / 2 A n , (nl^J^A)! = Op 0 (^ 2 ), 

where p n = n” 1 / 3 , A„ = n -1 / 2 YJi=\ ^Ao O^VQ) + h* (Cij)Qe 0 ,K 0 {Xi) Q N p ( 0,Ig o ] Ao ) and the 
information matrix Ie 0 ,A 0 = ^o[{Z Aq(C') + h*(C) ) (ZAq(C') + h*(C)) T Q q q Aq (A)] . 
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APPENDIX 


A.l Independent prior for the adaptive procedure 


In this section, we explain why independent priors are not suitable for adaptive Bayesian procedures. 
In short, we need to impose a very s tringent condition on the least f avora ble direction in this case. 
We focus on the same GP prior as in van der Vaart and van Zantenl ( 20091 ). but slightly modify the 
prior for the inverse bandwidth A to be a truncated Ga(ao, b 0 ) whose pdfp(t) oc t a °~ 1 e~ b ° t I(t > to). 
Introducing this truncation is for technical simplicity and will not sacrifice the adaptivity of the 
prior. The following result is an adaptive version of Theorem 15.II under independent prior (PI). 


Theorem A.l. Let Xi = (f/j, V), Y)) € W x W* x R, i = 1 be n observations from the 

partially linear model (12.31) . Consider the independent prior (PI) with the above adaptive II p . 
Assume the following conditions: 


(i) . rjo is Holder a-smooth, where a > d/2; 

(ii) . The information matrix Iq = Pq{U — E\U\V ])® 2 is invertible; 

(iii) . The least favorable direction h*(v) = E[U\V = v] belongs to the RKHS H to ; where to is the 

truncation parameter in the above truncated Ga(ao,bo). 


Then the following second order BvM theorem holds: 

sup|ll(6> € A\X^ n ' ) )-N p {eo+n~ l/ 2 X n ,{nI(f 1 )){A)\ = 0 Po (y/np 2 n \ogn) = 0 Po (n"^+^ (log n) 2d+3 ). 
A 


We point out that this theorem requires a strong constraint on the least favorable direction 
h*(v), i.e., Condition (iii). In fact, a sufficient condition for a function / to belong to HP 0 is that it 
has a Fourier transform / satisfying 


( |/(A)| 2 e c l A l 2 / t odA < oo, 
JWL d 


for some c > 0 ( van der Vaart and van Zanten . 2009h . This condition implies the infinite differ¬ 
entiability of h* and imposes a strong restriction—for example, it fails even with constant and 
polynomial functions. On the other hand, we have to admit that Condition (iii) is only a sufficient 
condition although our empirical results indicate that a similar condition is necessary. 


A.2 GPLM: Assumptions and LFS Lemma 

Assumption 3. (a) There exists some positive constant Co such that Uo(exp(f |fU|/Co)|T) < 
C 0 e Cot , for all t > 0, i.e. W = Y — m 0 (T) is sub-Gaussian. 

(b) There exist positive constants C\, C 2 , C 3 and C 4 such that: 1. \/C\ < V(s) < C\ for all 
s € F( R); 5. l/C 2 < |i(OI < C 2 for all f G K; 3. |2(<£)-^o)| < f° r al1 ^-^ol < W 

l |/(0 ~ m)\ < C 4 |e - ^o| for all\f - < Vo. 


The assumption that V and l are both bounded could be restrictive and can be removed in 
many cases, such as the binary l ogistic regression model , by apply ing empirical process arguments 
similar to those in Section 7 of Mammen and van de Geer (1997]). Under Assumption LTtfbl) . the 
following lemma describes the least favorable curve for the class of GPLM. 
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Lemma A.l. Suppose Assumption [Mb\) is met. Then the least favorable curve rf(Q), defined 
as the minimizer r/ of 


Eologfpe^riJpe^) = E 0 / , , ds = E 0 

Jme 0iV0 (T) V{s) 


r e - (T) (rne^jT) - s) 

mg 0 ,„ 0 (T) F ( S ) 


ds 


as a function of 6, takes the following expression 


rf{0) 

with 


= Vo + (0- 0 O ) T h*(V) + O{\0 - 0 O \ 2 ), as \0 - 0 O \ 0, 

, ,, , = E 0 [Ufo(T)lo(T)\V = v\ 

E 0 [fo(T)l 0 (T)\V = v] ' 


(A.l) 

(A.2) 


Equation (I A. ID provides a local expansion of the least favorable curve defined in (12.21) . which is 
enough for our purpose, since the posterior of 0 is expected to concentrate in a -^/re-neighborhood 
of 0 O . 


Proof of Lemma \A.l[ By Assumption [3tfbl). for any (0,r/), we have 


Eq log (pe, v /pe 0 ,vo) < - 1 E 0 (m e , v (T) - rre 0 OiT?o (T )) 2 

<-{ClC 2 )- l Eo\ge )T1 (T)-go(T)\ 2 
< - 2(C' 1 2 C ' 2 )- 1 (|6 - 0 O \ 2 + E 0 \r, - g 0 \ 2 ), 


where the first line follows since V(s) < C\, the second line follows by the fact that |/(£)| = 
|Z(£)| • |E(.F(£))| € [1 /(C'iC' 2 ),C'iC' 2 ] and the third line follows by the assumption that U G [0, l] p . 
Similarly, we have 


Eo logO Pe, v / P e 0 ,vo) > ~ CfC 2 E o {(0 - 0 O ) T U + 77 (C) - 770 (E)) 2 . 
Let fj(0)(v) = 77 o(u) — (0 — 0o) t E[U\V = vj. Then by definition of 77 * (0), we have 

Eo\og{p e ^*{e)/Pe 0 ,vo) ^ E olog(p g> ^ g )/pe 0 , vo )- 

Combining the above inequalities, we obtain 

- 2(C 2 C 2 )- 1 (|0 - 0 O | 2 + E 0 \ v *(6) - 7? 0 1 2 ) 

> - C 2 C 2 E 0 {(9 - 0 O ) T U + fj(9)(V) - 770 (E )) 2 
= - CfC 2 E 0 (U - E[U\V}) 2 \0 - 0o| 2 , 


which implies 


r ] *(0)-r ] o = O(\0-9o\). (A.3) 

For an arbitrary function h(V) : —>• M p with ||/i||^ < 00 , consider 

9 e,r)*(0),t = 9e, g *(e) + th, 

for t in a neighborhood of 0. The optimality of gg^*^ implies that 

0 = E 0 [(Y - F{g e ^*^))l{g e ^*^)h{V)\ = E 0 {Eo[(F(g g0:VO ) - F{g e ^^))l{g ev *^)\V]h(V)}. 
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Since the above equality holds for any h, we have 

e o[( f (90 o ,vo) - F (90,v*(e))) l (9e,r,*(O))\V = v\ = 0 , a.s. 

The last display, equation (1A.3D and Assumption l3l (jb|) together, implies 

E 0 [fo(T)l 0 (T)(6 - 9 0 ) T U\V = v) + E 0 [f 0 (T)l 0 (T)\V = v] ( V *(9) - Vo )(v) = 0{\9 - 9 0 \ 2 ),a.s. 
This gives us 

V*(9)(v) = m (v) -{9- 9 0 )h*{v) + 0{\9 - 0 O | 2 ), as 1 9 - 9 0 \ ->■ 0, 
with h* defined by (1A.2D . □ 


A. 3 Proofs of Theorem 13.11 


Let B n : = {|0 — 0o| < Me n , rj G H,,}, where M is a sufficiently large constant. Then we have 
H(B n \X^) = 1 — Op 0 (5 n ) for M > 1 by Assumption [l] For any measurable A C 0, 


|n(0 € A\xW,B n ) - n(0 G A\X^)\ 


n(0 G A, B c n \X n ) - n(0 € A\X^) [1 - U(B n \X^)] 

U(B n \X^)) 

<2|1 - U(B n \X^)\/ U(B n \X^) = Op 0 (5 n ). 


Taking the supremum of A over all measurable subsets of 0, we obtain 


sup |n(0 G A\X<ri,B n ) - n(0 G A\X^)\ = Op 0 (S n ). 


Therefore, it remains to show that 

sup |n(0 G A\X\,.. .,X n ,B n ) - N k ( A n , (n/o)- 1 )^)! = 0 Po [^(n" 1 / 2 logn)], (A.4) 

A 

where 

n(0 G A\X \,..., X n , B n ) = f |^-dn e (0)/ [ dUe(9 ). (A.5) 

•/An{|0-6»o|<Me n } S n (0o) / ./|0-0 O |<Me„ 5 n (0 O ) 

Recall the dehnition of A n by (11.211 . Since the pdf of a normally distributed random variable 
with mean 9q + n -1 / 2 A n and covariance matrix (ra/o) -1 evaluated at 9 is proportional to 


exp | (9 - 0o) T J>(*i) - ^(0 - 0 o ) t I o (9 - 0 O )|, 


it suffices to prove 


S n (9) 

An{\e-e 0 \<Me n } S n (9 0 ) 


[ e xp{(9-9 0 ) T y2e 0 (X l )-^(9-9 0 ) T I 0 (9-9 0 )\d9- [ 

JA V 1 J JA 

=O P0 [R n (n- 1 / 2 log n)] exp j (9 - 0 o ) T f>PQ) - ^(0 - 0 o ) T I o (9 - 0 o )}d0- 


dn( 0 ) 


(A.6) 


In fact, one can plug in the above equation with A = A and A = 0 respectively, and then simple 
algebra leads to (IA.4I) . 
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Since ne 2 > — log-R n (ro 1 / 2 logn) —>• oo and 7'o = Op 0 (y / n), by choosing M sufficiently 
large we have 


Mn{| 0 - 0 o |>Me n } 


(0 - 0 o) T - 9 (0 - ^o) T /o (0 - 0 O ) ^ 

1=1 J 

P 0 [i? n (n _1/2 logn)] f exp {(<9 - 6» 0 ) r V'4(X i ) - ^(0 - 6 , 0 ) r 7 0 (6' - 0 O ) 
J e l ? =i 2 


(A.7) 


dO. 


By a subsequence argument, Assumption [5] implies 
S n {9) 


sup 

\9—9o\<Me n 


log 


s n (e o) 




(0 - 0 O ) T X] 4(*i) + -z(6 - 0o) T /o(0 - 0o) /i*n(|<? - 0o|) = Ofb(l). 


(A.8) 


For every 0 such that \9 — 6 q\ < Mre 1 / 2 logn with M sufficiently large, the above analysis 
implies that 


exp 


(d - e 0 ) T ]T i Q {Xi) -^{0- e 0 ) T i 0 (9 - 0 O ) 


i= 1 


s n (e) 


Sn( 0 0 ) 


< exp 


|(@ - dof'peoiXi) - ^(0 - 0o) T Io(9 - 0 O )}| exp {0 Po [it^n" 1 / 2 log n)]} - l| (A.9) 


0 Po [R n (n 1/2 logn)] exp < {0 - 0 O ) T X] 4 PC) - ^(9 - 9q) T Io{6 - 9q) 


i =1 


where the last step follows since i? n (n -1 / 2 logn) —>• 0. 

For every 9 such that Mn -1 / 2 logn < \0 — 0o| < -ATe n with M sufficiently large, we have by 
Assumption [2] and the invertibility of Jo that R n (\9 — 9o\)/[n{9 — 9 q) t Iq{9 — $o)] = o(l). Combining 
this fact and the last display, we obtain 


MnfMn- 1 / 2 log n<\6—9o\<Me n } 


exp 


Ue- 9 0 ) T £l 0 (X i ) - ^(9 - O o ) T Io(0 ~ 9 O )]d0 

' i =1 ' 


r s n {9) 

I AnfMn -1 / 2 \ogn<\9—9o\<Me n } S n (9 q) 


dU(0) 


=Op 0 {l) 


\9—9o\>Mn~ 1 / 2 logn 


exp {(0 - 0 O ) T £ WU) -j(0~ 9 0 ) T Io(0 - 9o)\d0 (A. 10) 

^ i=i > 


=O P0 (e~ Mc ^ 2 ) £ exp | (9 - 0 O ) T £ 4PC) - |(0 - 4) T 4P - 0 o )}d0 
=0 Po [R n (n~ 1/2 log n)\ j exp | (9 - 0 O ) T X^4PC) - ^(0 - 9 0 ) T I 0 (6 - 0o)jd0, 


for M sufficiently large, where c > 0 is a constant only depending on 7o and the last step follows 
by the fact that f exp{af — bt 2 }dt x b ~ 1//2 for b 2> min(a, 1). 

Finally, (1A.7I) together with (IA.9D and (I A. 101) implies (1A.6D . 
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A.4 Proof of Corollary 13.21 

For each s = 1,... ,p, taking A = R x ■ ■ ■ x A s x • • • x R in (13.61) . where the s-th component is A s 
and the rest are K, we obtain 

sup \U(e s eA s \X 1 ,...,X n )-N p (d 0tS + n- 1 / 2 A ntS ,n- 1 I^)(A s )\=0 Po (S n ), (A.ll) 

A S CK 

where A n;S is the sth component of A n and Iq s the (s, s)-th element of the matrix I/ 1 . Let 9^ s be 
the median of the marginal posterior distribution of 9 S . Then taking A s = (—oo ,9^ s ) i n the above 
formula yields 

- °o,s ~ n- 1/2 A n ,,)) - 1/21 = 0 Po (S n ), 

where <J> is the cdf of the standard normal distribution. By the continuity of l*" 1 , we have 

n l/2 ( ^ r l/2 ( ^ _ 9qs _ „-l/2£ nja ) = OPo (S n ), 

which proves the claimed result. 

A.5 Proof of Corollary 13.31 

Recall that Iq s is the (s, s)-th element of I/ 1 . By choosing A s = (—oo ,q s , a ) in (lA.llj) and the 
definition of q s>a , we have 

- 9 0 , a - n- 1 ' 2 A n>s )) -a | = 0 Po (S n ), 

which implies q S)0l = #o,s + n -1 / 2 A njS + n -1 ^ 2 (Iq 3 ) 1 ^ 2 z a + n -1 / 2 0 Po (S n ), where z a denotes the a-th 
quantile of a standard normal distribution. This completes the proof of the claimed result. 

A.6 Proof of Lemma [3~51 

With the definition of S n and the conditions in the lemma, we have 

S n (0) = [ exp{l n (9,r]) - l n (d 0l r] 0 )}dU e H (r]) 

JHn 

= exp | \fn{9 n - 9 0 ) T g n - i n(9 n - 9 0 ) T Io{9 n - 6 0 ) 

+ Op 0 [G n (\9 - 6 » 0 |)]| f exp{l n (9 0 ,v - Ar](9)) - l n (9 0 ,ri 0 )}dU 9 H (T]) 

) J 'M.n 

= exp - e 0 ) T g n - i n(9 n - 9 0 ) T I 0 (9 n - 9 0 ) 

+ O Po [R n (\0-0 o \)})}s n (0 o ), 

where the second line follows by condition (Al) and the last step follows by condition (A2). Finally, 
the ILAN in Assumption 2 follows by taking logarithms of both sides of the above equaility. 
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A.7 Proof of Lemma |4~T1 

Under (A5), we have 



e ln ^°’^dnH(rf) 


fu n -(0n-do) T h n e ln{ - e °^ dn H {r]) ' J ^ 

J H e l »( e °^dU H (v) fn n 
n H (n n -( 0 n - 0 o ) T h n \x 1 ,...,x n ) 
n H (n n \x 1 ,...,x n ) 


e in( 0 o ,v)<tn. H (ri) 

e ln(Oo,V) dX\ H (ji) 


— 1 + Op 0 (< 5 n ). 


(A.12) 


By applying a change of variables r/ = rj — (9 n — 9 q) t h n in the numerator in (A2) and using (A4), 
we can obtain 



e l n {6 o,ri-A v (0 n )) dUn ^ = I 

JH n -{ 6 n -e 0 )Th n 


e l ^ 0o ^f n (?j)dU H (?j) 


• {l + Op 0 [G n (\9 n — (9 0 |)]}, 


which combined with (A6) yields 

f Hn e l ^ 9 o ^- A ^ e ^dU n (r]) 


h n -[ 0 n-eoVhn e Ud 0 ' v) dn H (r 1 ) 
Finally, combining (1A. 12|i and (1A.13|) implies (A2). 


= l + Op 0 [G n (\9 n -9 0 \)}. 


A.8 Proof of Lemma l4~2l 

Applying a change of variables rj = r) — (9 n — 9o) T h n , we obtain 


[ e U(0O,»7-A?7(0„))^n^ (jy) 

Jn„ 

L 
L 


e ln(0o,ri-A'n(9n)+(9n-0o) T h n )^ 

H n -{e n -eo) T 'h n H,-(e n -o 0 ) T h n 

e ln(Qo,V-A.n(6 n ) + (6 n -6o) T hn) ^-Q^O (rj) 


H n -(e n -e 0 ) T h n 


= (1 + Op o [G n (max{|0 - 9 0 \,n 1/2 logn})]) f 

Jh 


n„-{e n -e 0 ) T h n 


^°’^dU 9 °{?j) 


(A.13) 


= (l + 0 Po [G n (max{\0-9o\,n 1/2 log n})]) [ e ln ^ 0 ^ dU e °(rj), 

JHn 

where the second step follows by the definition of the prior (PD), the third step by (A4), and the 
last step by (1A.12D . 

A.9 Proof of Theorem 15.11 

For readers’ convenience, we sta te the maximal inequality for sub-Gaussian random variables 
([van der Vaart arid Wellneri . 1996, Corollary 2.2.8) which is extensively applied in our examples. 
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Lemma A. 2. Let {Wt : t € T} be a separable sub-Gaussian process and d be a semimetric on 
the index set T defined by d(s,t ) = cr(W s — Wt). Then for every 6 > 0 and x > 0, 

p( sup \W S -W t \>x) < 2 exp { - x 2 / (K f* i/log N(e, T , d) de) ), (A.14) 

X d(s,t)<6 7 l / \ J J 

E sup \W S -W t \ < K jy y/\og N(e, T. d) de, (A. 15) 

d(s,t)<8 

for a universal constant K. 

We consider the independent prior and the dependent prior separately. 


Independent prior: Verification of Assumption |T] We apply Lemma [3.41 here. Let N denote the 
set of natural numbers and No = NU {0}. For any d dimensional multi-index a = (oi,..., ad) € Nq, 
define |a| = a± + ■ ■ ■ + ad and let D a denote the mixed partial derivative operator /dxf 1 ■ ■ ■ dx a d d . 
For any real number b, let \b\ denote the largest integer strictly smaller than b. The Holder class 
C 7 ([ 0 , l] d ) is defined as the set of all d-variate k = |/yj times differentiable functions / on [ 0 , l] d 
such that: 


Cr 


m , \D p (x) -DP(y)\ 

= max sup \L) H t(x + maxsup-;-;—- 

m<k xG[0 ,i]d m=k x ^ y \x-y\^~ k 


< oo. 


We use Cj to denote the unit ball in C 7 under the norm 


We choose the sieve T n as T\ 


■e 


II C7. 




with 


T e n = l—cy/n , cy/n} p and If[ = p n Cf + M n 


l > 


(A.16) 


with c a constant sufficiently large, p n = n~ a ^‘ 2 a+d \logn) d+1 , a n = n 1 ^ 2oi+d \ and M n some con¬ 
stant to b e determined later. The second term in the sieve constructi on for p borrows the 

i deas from van der Vaart and van Zanten ( 2008al) and the first term p n Cf from de Jonge and van Zanten 
( 201 .‘il l. We remark that in van der Vaart and van Zantenl 1 2008al ) the first term in their sieve con¬ 
struction (B n on page 20) is a multiple of Hi : = {/ € T> 2 ([0, l] d ) : ||/||oo}, causing the functions 
in E p to be non-differentiable. As a consequence, the e-covering entropy of their sieve can not be 
properly bounded w hen e < p n as in our proof (see (I A. 201 ) below). 

By Lemma 4.5 in van der Vaart and van Zanten ( 2003 ), for a fixed scaling parameter a and any 
e < 1/2, we have the following upper bound on the covering entropy of the unit ball in the RKHS 


log N(e,] 


I?, 


<K ia d log 


1 


1 -\-d 


(A.17) 


where K\ is some universal constant. For squared exponential kernel, all elements in Hf are 
infinitely differentiable. Consequently, by slightly modifying their proof, the sup-norm in the above 
result can be generalized to the || • ||c7-norm: for any smoothness index 7 > 0, 


log N (e, 1 


n, 


I cr. 


< log 


bgi) 


Then by the relationship between the small ball p robab ili ty of a Gaus sian process and the covering 
entropy of the unit ball in the associated RKHS (Li and Lindel. 1999l l. we can obtain by following 
the proof of Lemma 4.6 in van der Vaart and van Zanten ( 20091 ) that for any 7 > 0, 


-lo g n(||W a || C 7 < e) < Ka a 


1 a 

log 

e 


1 +d 


(A-18) 
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Denote the right hand side of the above by <^>o(e). Note that the above also holds when the 
|| • ||c7 nor m is replaced with the sup-norm by app lying inequality (1A.17|) instead. Then by Borell’s 
inequality ( van der Vaart and van Zanten . 2008bh . 


U(W a i MH“ + eCf) < 1 - $($ -1 (e“*° (e) ) + M), 


(A-19) 


where is the c.d.f. of the standard normal distribution. Note that for M > 4y^§(e), the right 
hand side of the last display is bounded by e“ A/2 / 8 . 

By applying the inequality (IA.17j) with a = a n , we can obtain the following bound on the 
e-covering entropy of the sieve T n for any e > 0, 

logN(4e, T n , || ■ ||oo) < K 2 npl(logn)~ {1+d) (^log +clog^^, (A.20) 

where we have used the fact that the covering entropy of C“([0, l] d ) satisfies log N(e,Cf, || • ||oo) < 
K 2 e~ d / a and K 2 is some constant. By choosing M n = c\np\ with C 2 sufficiently large so that 
M n > 4-y/ cj)Q n ( p n ) and applying inequality (IA.19I) . we have the following complement probability 
bound on J- n with some constant C 2 > 0, 

n(^) < exp (-c 2 npl). (A.21) 


Therefore, sieve J- n satisfies condition a and condition b in Lemma 13.41 with £ n = p n . Next we 
verify condition c in Lemma 13.41 For the partially linear model, we have, 


K{P t 


(n) 

00,Vo’ 


= 


Eo{log (dP^JdP^)} 

n 

^[(e-Oon + iv-voW)}' 


i=l 


and 


=Eo{|log {dpMjiPft) - (} 


=E, 


FL ^ 

4(X> [(0 - 80 )Ui + (v- m)m }) I u n , v n ] 


i=l 


= Y J \{e-e Q )U i + {r 1 -r lQ ){V i )}\ 

i=1 


where the last step follows by the fact that given (17*, V)), the random variable ]7” =1 Wi[(9 — 9o)Ui + 
(77 ~ Vo)(Vi)] follows a normal distribution with mean zero and variance ~ $0 )Ui + (77 — 

1 lc\ 

Vo )(^i)] 2 ) • Therefore, for any e > 0 we have 



£ ) ={(0,7,) : K(P^ m ,P$) < ne 2 ,V 2 (P^ m 
={(9,rj) : \\U T (9-9 0 ) + r]-r] 0 \\ 2 n <e 2 }. 



} 


As a result, by applying inequality (IA. 181) with the sup-norm and e = p n / 2 , we obtain that for the 
independent prior, there exists some constant C 3 such that 

n (B n (p{j n \ p n )) > n©(||r/ - 770 11 00 < Pn! 2) • n^(|6» - 9 0 \ < p n / 2) > exp(-c 3 7ip£). (A.22) 
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Before applying Lemma 13.41 we remark that although the average Hellinger metric d n used in 
Lemma 13.41 is equivalent to the empirical metric II • IL only if the class of re gressio n functions is 
uniformly bounded, the argument in Section 7.7 of Ghosal and van der Vaarti (|2007h suggests that 
we may use || • || n instead of d n throughout. Hence, the distance (between {0, 77) and {O ', 77')) is given 

by || U T {0 — O') + 77 — r]'\\ n : = \J n -1 Y17= 1 iPf — $ 7 ) + ~ V'{Vi)) 2 - Therefore, by combining 

(IA.20P , (IA.21I) , (I A.22 11 and Lemma 13.41 we can prove Assumption [T] and conclude that 

n{ II U T {0 -0 o ) + v- VoWn < Mp n , OeF e n , V £ J*\ X u ...,X n } = l- Op 0 (6 n ), (A.23) 

where M is a constant and S n = e~ Cne ™ for some C > 0. 

Next, we show that under Condition (ii) in Theorem l5.ll (|A.23I) implies n(|0 — 0o| < Mp n , \\r] — 

VoWn < Mp n |aH) = 1 - Op 0 {5 n ). Denote I 2 n = n" 1 £? = 1 {(.V ~ VoWi) + {0 - 0 o ) T E[Ui\V^) 2 . We 
need to apply the following lemma, whose proof is provided in Subsection IA.111 


Lemma A.3. Under the condition of the theorem, we have 

IjgjXi {Ui - E(mm) • {(V ~ m)(Vi) + (g ~ OofEmVj)) | 

y/rip n I n log n V Jnp 2 n 
By Lemma [A. 31 we have that for any 0 G F e n and 77 G Ff, 


sup 

0eJt,veJ^l 


= Op 0 {l). 


1 

-^{(O-OofUi + iv-poW )) 2 

n i=1 

1 n / \2 

= - E ( 0 - 0 o) T (^ - ^[C/i|Hi]) + ((77 - 770 )^) + {0 - 0,) T E{Um) 

n i=i ' ' 

= {0 - 0 O ) T - E[U\V]) T {U - E[U\V] ) + Opoln- 1 / 2 )] (0 - 0 O ) 

+ 0 Po (n~ 1/2 ) ■ \0 - 0 O | • (I n log n V p n ) + I 2 
> (K 3 + OPo { 1)) • (|0 - 6 >o| 2 + {In V Pn) 2 ), 


for some constant K 3 , where in step (i) we have applied the central limit theorem for the sum 
— E[Ui\Vi]) T {Ui — E[Ui\Vi\) and in the last step we have used the assumption that the 
matrix P{U — E[U\V]) T {U — E[U\V]) is invertible. Combining the above with (1A.23I) . we obtain 


n(|0 - 001 < Mp n \X 1} ...,X n ) = 1 - Op 0 {5n). 

Again applying (1A.23I) and using the inequality (a + b) 2 > h 2 /2 — a 2 , we have that for M sufficiently 
large, 


n(||77 - 770 1|n < Mp n \Xi, . . .,X n ) = 1 - Op 0 {5 n ). 

Combining the two above yields 

n (|0 - 00 1 < Mp n , ||77 - 77o|| n < Mp n \X^) = 1 - Op 0 {5 n ). 

Therefore, if we define the localization sequence F n = {77 G Fn • \\p — 770 1 | n < Mp n }, then the last 
display and Lemma 13.41 implies 

n (|0 - 00 1 < Mp n , 77 G U n \X <">) = 1 - Op 0 {5 n ). ( A . 24 ) 
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Note that with the enlargement procedure described after assumption (A5), the above still holds. 
Verification of (A3): (A3) is true with h*(v) = —E[U\V = v\. 

Verification of (Al): We verify assumption (Al) with the above choice of 7i n . For the partially 
linear model, A rj(9) = — (9 — 9q) t E\U\V). We use the notation P n = n~ 1 Y^= 1 to denote the 
empirical measure and = n -1 / 2 )T )” =1 — P) the empirical process with respect to an i.i.d. 

sequence {Aj}. For a partially linear model, we can express the log likelihood ratio by 

log i X>i - (x - xo)(Vi) - (0 - «o f(u, - Elt/mi)] 2 

aJ 6 , 0 ,r / 0 z i=1 

1 n 

+ i Wi ~ (9 ~ Vo)(Vi)] 2 

i =1 

=(9 - e 0 ) T Y to(Xi) - |(0 - 9 0 ) t I 0 (9 - 9 0 ) + -yfc{0 - 9ofG n {U - E[U\V]) 2 

i— 1 

n 

- (o - 9 o) t y ~ E l u \ v i\) fa - mm), 

2—1 

where l'o(A) = w T (U — E\U\V\) is the efhcient score function and I o = P(U — E[U\V])(U — 
E[U\V]Y = Eq 0 ^ 0 1 o^q the efficient information matrix. 

We next analyze four terms in the preceding display. By central limit theorem, the third 
term is Op 0 [y/n\9 — #o| 2 )- An upper bound for the last term could be obtained by applying 
Lemma fA. 21 conditioning on V^’s, where the corresponding semimetric d is bounded by the sup-norm. 
Inequality (|A.28j) provides an upper bound for the covering entropy of the space {p — po : p £ 7i„}. 
Note that even working with the enlarged set 7i n described after assumption (A5), the additional 
term in the upper bound is negligible. Since \\rj — rjo\\ n < Mp n for any p £ din , and Ui conditioning 
on Vi are bounded and i.i.d. with E{Ui — -E[D|V)]|Vi} = 0, an application of Lemma [A.21 and 
inequality (IA.28I) yields 


f 1 n 

Eo\ sup —\Y{Ui-E[U\Vi\)(Ti-Tiom)\ 
l V&H n V n i=l 

oo) de < K 4 s/7ip 2 n , 

for some constant K 4 . Hence 


V!,...,V n 


fMpn 

<K / y/logN(e,H n , 
Jo 


(A.25) 


sup (9 - 9 0 ) t \ Y i u i ~ E i u \ v i\)(v ~ m)(yi)\ = °Po{ n \° ~ 0 o\P 2 n }- 

V&H n 


2 — 1 


Combining the above arguments, we verify (Al) with G n (t ) = \/nt 2 + np ' jt. 


Verification of (A6): By Lemma 4.3 in van der Vaart and van Zanten ( 20091 ) and the assump¬ 
tion that each component of E[U\V = ■] is at least a-smooth, there exists a sequence of func¬ 
tions {h n = (hi tn ,... ,h p>n ) T : M. d —> M p }, such that || h n + E[U\V = -JHoo < < p n and 

||h S)n || ari < Ca d < Cy/np n for all s = 1,... ,p. Do a change of variables p —>• p + (9 — 9o) T h n . Since 
for Gaussian processes, the Radon-Ny kodym derivative dUp^pg/dUp (W) = exp(Ug — ||(?|| 2 /2) 
(Ivan der Vaart and van Zantenl. l2008bl . Lemma 3.1), where U : M a " -> A is a random operator 
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such that Var[t/ (g)] = \\g\\ 2 n for any function g in the RKHS HP” associated with the GP, we have 

log fn(ri) = logdU Hr+ ^ 0 _ eo) T hn /dU H (W ) 

= (0 - 0 0 ) T U(h n ) - (6 - 0 o ) T H n (0 - 6 0 )/2 = Op 0 (G n (\9 - (9 0 |)), 

with G n (t) = \/np n t + np 2 n t 2 , where H n is a p x p matrix with H st = (h s>n , fo,n)a ri < Cnp 2 . 
Verification of (A4): For the same h n as defined above, we have 

A v (0 n ) = ~(0 n - 0 o ) t (E[U\V = ■} + h n ) = O(\0 n - 0 o \ P n). 

Then for 77 € 7~L n and 0 satisfying \0 n — 6q\ = op 0 (l), 

ln{0 0 , V + {On - 0o) T (K - h*)) - l n {0 0 , 77 ) 

. n 1 71 

= - 2 ( Wi + ^ - do){Vi) + (0 n - 0o) T {h n - h*))(Vi )) 2 + 2 ( Wi + (v - 0 °)(^)) 2 

i=1 i=1 

n 

= - (On ~ Oof Y1 (“i + (v- m)(Vi))(h n - h*)(Vi ) + O(\0 n -0 o \ 2 -\\h n -h*\\ n ). 

2—1 

By Cauchy’s inequality, for 77 £ H n 

n 

| y^(?7 - rn)(Vi)(h n - h*)(Vi) | < 7 z11 77 - po\\n\\h n - h *|| n = 0(np 2 n ). 

i— 1 

Since E\ Yf=i w i(hn - ^*)(l / i )| 2 = n\\h n - h*\\ 2 n = 0(np 2 n ), we obtain | Yfi=i w i(h n ~ h*)(Vi)\ = 
Op Q (y/np n ). Combining the above three, we have 

ln(Oo,V + (On — 0q) T (h n — h*)) — l n (0o,v) = Op 0 (G n (\O n — #o|)), 

with G n (t) = sjnpnt + np 2 n t + np n t 2 . 

Finally, applying Theorem 14.31 yields the second order semiparametric BvM theorem for the 
independent prior with a remainder term 

G n (rT l/2 log n) + G n (n~ 1/2 logn) + 8 n ~ y/np 2 n \ogn. 


Dependent prior: Verification of Assumption [TJ Without loss of generality, we assume that 
11 h n 11 00 < 1. Similar to the independent prior case, we construct T n as T® © Tn, with 

P n = [—cy/n, cV^\ p and T% = p n C f + + {(0 - 0 o fh n : 0 £ J*}. (A.26) 

Comparing to the sieve ((A. 161) for the independent prior, the third term {(0 — 0$) T h n : 0 £ J 7 ®} is 
added to reflect the dependence structure. For such a sieve J- n , the covering entropy upper bound 
in inequality (1A.20|) is still true for some constant c. 

Moreover, we have the following complement probability bound on Tn, 

n(J^) < n @ ((j^) c ) + n((j^) c ). 


The first term above can be bounded by e C2npn for some constant C 2 and the second term satisfies 


n((^) c ) < n 0 ((^) c ) + / u e n ( v 0 Pn cf + M n h?" + (o - o o ) T h n )du e (0) 

Je&Tg 

(i) (ii) 

< exp{-c 2 77/9 2 } + 11^(77 0 p n Ci + M n EI“ n ) < 2exp{-c 2 77/9 2 }. 
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Step (i) follows since by the definition of the dependent prior we have 11^(77 € A) = 11^(77 + 
(6 — do) T h n € A ) for any measurable subset of T~L ; while step (ii) follows by inequality (1A.19I) . By 
combining the above arguments, we obtain n(J r ^) < 3e~ C2npn . 

Finally, there exists a constant, denoted by C3, such that 

II (B n (p( n \p n )) > II(||j7 - 770 II 00 < Pn/ 2, 1 0 - d 0 1 < p n / 4) 

= / n^dl 7 ? - ^olloo < Pn/2)dUo(d) 

J\8 — 0o|<Pn/4 

(ill) (22l) 

> n^(||r/ - 770 II 00 < Pn/4) • n e (|0 - 0 O I < Pn/4) > exp(—c 3 np^), 
where step (iii) follows since by the definition of the dependent prior we have 

- % || 00 < Pn/ 2) = nf?(||77 +{d - dofhn - V 0 W 00 < Pn /2) 

> n^(||?7 - 770 II 00 < Pn/ 4) 

for any 0 satisfying |# — #o| < Pn/ 4, and step (iv) follows by applying inequality (1A.18I) with the 
sup-norm and e = p n / 4. Based on these results, the rest of the steps are the same as those for the 
independent prior. 

The verifications of (Al), (A3) and (A4) are also the same as those for the independent prior. 
Since we do not need to verify (A6) for the dependent prior, an application of Theorem 14.31 yields 
the claimed result. 


A.10 Proof of Theorem IA.ll 


Verification of Assumption [TJ In this adaptive case, we apply Lemm a 13.41 with a modified sieve for 
the nuisance parameter 77 from van der Vaart and van Zantenl ( 2009h . This sieve construction is in 
the same spirit as the sieve constructed in Theorem 15.11 for the non-adaptive scenario. 

More specifically, we choose the sieve T n as T % ® J~n- with 


-p~\ _ 
J n 


M n 


P n = [- Cy/n , Cy/n\ P , 

' + PnCf) U ( \J (M„H?) + PnCf), 


(A.27) 


Sfi 


with c a sufficiently large constant, p n = n Q /( 2 Q+a! )(logn) d+1 , and ( M n ,r ni 5 n ) satisfies 


D 2 r d n >2C 0 npl r p ~ d+1 < e c ° n &, 

M/[ > 8C 0 npl S n = C\p n /(2\fdM). 

Borrowing the results in the proof of Theorem 3.1 in Ivan der Vaart and van Zanten ( 2009h and the 
intermediate results in the proof of Theorem 15.11 about the covering entropy and complementary 
probability for MHf + eCf for a fixed bandwidth parameter a, we can verify that T n satisfies 
condition a and condition b in Lemma 13.41 


loo) < K 2 npl(logn) (d+1) ( log ( - 


V V e 


1 -\-d 


log N (4e,.F n , 

for some constant K 2 and for some constant c 2 , 

P(K) < exp(-c 2 np 2 n ). 


+ K 2 


(7) 


djot. 


+ c log ( - 


(A.28) 


(A.29) 
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n< 


Under the conditions stated in the theorem, the rest of the proof of n(|0 — 6>o| < Mp n , \\p — 770 1 
Mp n \X<ri) = 1 — Op 0 (S n ) is similar to that in Theorem o and is skipped here. 

The verifications of (Al) and (A3) are the same as those in the proof of Theorem 15.11 
Verification of (A6): We choose h n = h* = —E[U\V = •], with which (A4) is trivially satisfied. 
Since by Lemma 1 in Ghosal and van der Vaart ( 2007h . we have Ii{A n \X^) = 1 — Op 0 (5 n ) with 
A n = {A < Cnp^} for C sufficiently large, where A is the random inverse bandwidth parameter in 
the GP prior. Consequently, we can always assume A < Cnp 2 by conditioning on the event A n . 
By the assumption on the least favorable direction h* = —E[U\V = • ], each componen t of E\U\V] 
satisfi es £7 [C/ s | V] € H to for s = 1 Then, by Lemma 4.7 in Ivan der Vaart and van Zanten 

( 200 ii ) . we have ||^[Z7 g |V = -]|| 0 < C\yfa , where C\ = sup s ||£?[{7 s |V r = -]||r 0 /\/A) is a constant 
independent with a. Here we recall that the || • || a -norm is the norm of the RKHS associated with 
the kernel K a . Denote the conditional prior of p given (A = a) by n a . Do a change of variables 
p —>• p — (0 — 6 q) t E[U\V], Similar to the proof in Theorem 15.11 the Radon-Nykodym derivative 
dA a + ( e _ g ^ Th */dll a (W) takes a form as exp((0 — 9 0 ) T Uh* — (9 — 9 0 ) T H(9 — 9q)/2) where H is a 
pxp matrix with H st = {h*,h^) a < Cfa and V&r{UE[Uj\V]} = | A[[/j | V = -]||^ for j = 1 ,...,p. 
Therefore, we obtain 


log fniij) = iogdn a Hd _ do)Th ,/du a (w) 

= (9- 0 o ) t U(E[U\V]) - ( 9 - 9q) t H(9 - 9 0 )/2 = 0 Po {G n {\9 - 0 O |)), 

with G n (t ) = \fnp n t + np 2 n t 2 . 

Finally, applying Theorem 14.31 yields the second order semiparametric BvM theorem with a 
remainder term 


G n {n 1/2 log n) + G n (n 1/2 log n) + 5 n ~ n 1/2 p 2 log n. 


A.11 Proof of Lemma I/V31 

The proof is based on Lemma IA.2I Since U/ s are bounded, conditioning on V^s, we have that 
We^ ■= ^7 Yl7=i (Uj— E{Ui\Vi)) ■ {{p—Po){Vi)+(9—9 q) t E(Ui\Vi)^ is a sub-Gaussian process indexed 
by (0, p) € T n = xXn with the semimetric d (defined in Lemma [A. 2D given by d((9, p), ( 0 ', p')) = 

((v - v')(Vi) + (9 — 9') T E[Ui\Vi\)~]^ 2 , which is dominated by 2||?7 — p'\\oo + 2| 9 - 9'\. 
Then by applying inequality (IA.20D and noticing the assumption that a > d/2, we have that for 
any 5 > p n , 

\/l + log N(e, X n , d) de < Cy/np n 5 , (A.30) 

for some constant C. By applying Lemma [A. 2 1 and inequality ()A.30D with 5 = p n , we further obtain 



sup 

e&T^,ri&Tli,In<Pn 


1 

J2 {Ui - E(Ui\Vi )) • {(p - po)(Vi) + (0- 9 0 f E(Ui\Vi)) 

* i =1 


Op 0 {Vnp 2 n ). 


To prove the claimed bound, we only need to show that 


sup 

d£T9,ri£FZ,I n >p n 


js e?=i {Ui - mm) • ((.v - mm) + w - QofEjum) I 

V^Pnln log n 


Opol 1 )- 
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By the reproducing property of RKHS W l and our definition of the sieve T n , I n is bounded by 
Ly/n for some constant L. We will apply the peeling technique by dividing the range (p n , Ly/n) of 
I n into Uf=i[/°n2 s_1 ,p n 2 s ), where S < clogn for some constant c. For each interval [p n 2 s ~ 1 , p n 2 s ), 
we first apply Lemma lA.21 and inequality ()A.30D with 5 = p n 2 s and then add them up to obtain 


Po I sup 

^0GJ r ^,rieJ r ^,In>pn 


I ^ ZJi=i \Ui E{Ui\Vi)) 






\fHPrJ-n 


> fog n 


5 EM 

sup 

S=1 

V 0€J*,r,€jy, 


pn^ 3 ^^In^pn^’ 

s 


sEM 

sup 

S=1 

^ eeT°, V £ j%, 


Pn2 s_ 1 </ n <Pn2 s 


^ EILi (Ui ~ E(Ui\Vi )) • {(.v - m)(X) + (Q- Qofmm) I 

y/^Pn^n 


1 

— E(Ui\\Q) • ((77 - vo)(Vi) + {e- e 0 ) T E(Ui\Vi)) 

i =1 


> log n 


> Vrip 2 n 2 


8-1 



S 

< ^2 2exp{—co(logn) 2 } < 2clog n ■ exp{—co(log n) 2 } —>• 0, as n —>• oo. 

S=1 


This completes the proof of the lemma. 


A.12 Proof of Theorem 15.21 

This theorem is proved by combining the verification of Assumption Q] in Theorem IA.1I and the 
proof of the dependent prior part in Theorem 15.11 with the sieve T n = T% © Pn and localization 
sequence W n = {p <E T% : \\rj - rj 0 \\ n < Mp n }, where 

P n = [~Cy/n, Cy/n] P , 

n = + Pn cA U ( |J (M n H“) + p n Cf) + {(d- 9 0 ) T h n : 6 eT°}. (A.31) 

' * n ' a<6 n 

Therefore we omit the proof of this theorem here. 


A.13 Proof of Theorem 15.31 

The verification of Assumption Q] for the GPLM is similar to that of Theorem IA.1I with the sieve 
T n given by (1 A.31 D and localization sequence T~L n = {77 E Tn ■ \\v ~ r lo\\n < Mp n }, and we also 
have the three inequalities (IA.20I) . (1A.21D and (IA.22D for this T n . It remains to check whether we 
can replace the Hellinger metric d n in Lemma 13.41 bv the empirical metric || • || n for the GPLM, 
which is indicated by the following lemma. Recall that for semiparametric models, we choose the 
parameter A in Lemma 13.41 to be the pair (6,rj) and the empirical distance between the regression 
function under parameters A = (0, 77 ) and X' = (0',rf) is given by || JJ T (6 — 6') + 77 — ? 7 , || n = 

V n - 1 sr=i ipf - e ’) +vm - T7'(p,)) 2 . 

The proof of Lemma IA.4I is provided in the next subsection. 
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Lemma A.4. For the GPLM, under Assumption 0 and condition a, b and c of Lemma \3.f\ 
there exists some constant C\ > 0 and large enough M such that 

U(\\U t (6 -9 0 ) + rj- poll n > M£ n \X u ...,X n ) = O p(n) (e~ c ' n &), (A.32) 

A o 

n((d, V )?T n \X 1 ,...,X n )=O p(n) (e~ c ^). (A.33) 

A o 

Based on Lemma IA.4I and inequalities (IA.20I) , (IA.21I) and (IA.22I) , the rest of the steps are the 
same as those in the proof of Theorem 15.11 

Next, we prove (Al). By Assumption [lj the posterior of p concentrates its mass in a small 
neighborhood R n = {p £ Ff : \\rj - p 0 || n < Mp n }. Write q n (9,p ) = Ya= i Qe^ jY) and recall that 
A p{9) = p*{9) — rjo = {0 — 9q) T h*(V) + 0(\9 — #o| 2 )- The proof of Lemma lA.51 is provided in 
Section IA. 151 

Lemma A.5. Under Assumption^ we have 

n 

q n {9 n ,ri+ATi(0 n )) - q n (6o,ri) = (9 n - 6 0 f J2 w MTi)(Ui + h* (^)) 

i= 1 

- ^n(9 n - 9o) T Io{9 n - 9o) + Op 0 [R n (m&x{\9 - 9o\,n~ l/2 \ogn })\, (A.34) 

for every sequence {9 n } satisfying 9 n = 9o + Op 0 (p n ) and uniformly for every p £ Tin, with Io = 
E 0 [lo(T)fo{T)(U + h*(V))(U + h*{V)) T ] and R n (t ) = nt 3 + y/nt 2 + np n t 2 + np 2 n t + yfnp 

To apply Theorem 14.31 it remains to verify (A4). By Lemma lA.il we have p — A (9 n ) + ( 9 n — 
9) T h n = 0(\9 — 9q\ 2 + 1 9 — 9o\p n ). Then (A4) is an easy consequence of flA.37j) . ()A.38j) and (1A.39|1 
with 9 = do-, fi = p and £2 = V ~ A (9 n ) + ( 9 n — 9) T h n in the proof of Lemma [A. 5 1 


A.14 Proof of Lemma lA~4l 

According to Section 7.7 in Ohosal and van der Vaartl ( 2007 1. we only need to verify that there 
exists a test function <f> n : W 1 —>• R for testing Ao = (9o,Po) versus Ai = ($ 1 , 771 ) relative to the 
empirical norm II Ao — Aj11,, : = \\U T (9o — $i ) + r?n — pi\\ n (instead of d n ) that satisfies the conclusion 
of Lemma 2 in iGhosal and van der Vaartl ( 20071 ). i.e. <f n satisfies 


P^U n (Y n )<e- cn " A °- Al "», and 


Fj n) ( 1 - (j) n {Y n )) < e -«.||Ao-Ai||= 


A v x — V'rH - 1 (A.35) 

for all A such that ||A — Ai|| n < c'||Ao — Ai|| n , where (c, c!) are constants independent of n and we 
use the shorthand Y n to denote the response vector (Y \,..., Y n ). 

More specihcally, we choose 

MY n ) = I{\\Y - moo^iT)\\ 2 n - ||V - m dl , m (T)\\ 2 n > 0), (A.36) 

where ||V — mg >r) (T )||^ : = n ^ 1 Y^=i {Yi — m 0 >? 7 (Tj)) 2 . Recall that by Assumption [3l under the 
residuals Wi = Yi — mg 0rV0 (Ti ) are i.i.d. sub-Gaussian. Therefore, we have that for any t > 0, 


D (n) 

Ao 


-t n 1 n 

MY n ) = ~ ~ E (Wi + me^oiTi) - me^AR )) 2 > 0 } 

71 i=1 U i =1 

n t U 

= P Ao } { ~ m 0 o,rio ( T i)) > 9 “ m Oo ,Voi T i)f} 


(*) 


< exp{-A. (T.) - m„ 0 ,„(r.)) 2 ) ]~[ 


1=1 


1=1 
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where in step (i) we have applied Markov’s inequality and used the independence among Y^s. Then 
by Assumption [3](a), there exists some constant C 2 such that 


<f>n(Y n ^ e 2 ( m $i ,t?i (T) m 0O’^O^d ) 2 ^ i7? i (Ti) m^ 0 )T? 0 (Ti )) 2 

i= 1 

t n 

= exp{-(- - Ct 2 ) 5Z( m 0i^i( T *) ~ ^,. 0 (^)) 2 }- 

i=l 

By choosing t = C ^ 1 in the above, we obtain 

1 n 

p \o<l>n{Y n ) < exp{-—^2(me 1 ,r n (T i ) - mg 0 ^ 0 {Ti)) 2 }. 

2=1 

According to Assumption [3] (b), we know that (mg ljm (Ti) — m 0 OiW (Tj )) 2 > 2 (CiC , 2)~ 1 {Uj{ 6 \ — 

# 0 ) + (?7i - Vo))" for i = 1,..., n, implying 

P^MY n ) < exp{—cn||A 0 - Ai|| 2 }, 

where the constant c = (CC 1 C 2)- 1 ■ This proves the first part of (1A.35I) . 

Now we prove the second part of (1A.35I) . By Assumption [3l under P P> the residuals W{ = 

Yi — mg tT i(Ti) are i.i.d. sub-Gaussian. Consequently, for any A and any t > 0 we have 

p[ n \i-MY n )) 

= P l n) {- E C W i + ~ - E {Wi + me^Ti) - mg^Tj ) 2 > o} 

2 =1 i=l 

n n 

= Pi n) {Y, tW i( m 0°™(Ti) - mg^iTi)) > - ]T {{rng, v (T t ) - mg^T ,)) 2 - {m e ^T x ) - mg^T ,)) 2 
2=1 2—1 

Then similar to the hrst part, by using Assumption [3](b) and applying Markov’s inequality we can 
obtain 

P{ n) (l - MY n )) < expj-D^nllA - A 0 || 2 + D 2 tn\\X - Ai || 2 + D 3 t 2 \\X 0 - Ai|| 2 }, 

for some constants D 1 , D 2 and D 3 . Therefore, for any A such that ||A — Ai|| n < c'||Ao — Ai|| n , where 
d = \fDi/{yfLh + V2 D 2 ), we have ||A - A 0 ||n > (1 - c')||A 0 - Ai|| n and 

Pj n) (l - MY n )) < exp{-£> 4 H|A 0 - X-.WI + D 3 t 2 ||A 0 - AJ 2 }, 

where D 4 = D\D 2 /(>/Di + y/Pth.) 2 ■ Finally, by choosing t = D±/ (2 D — 3) in the above, we obtain 

^ n) (! - <t>n(Y n )) < exp{—ct 2 ||A 0 - A 1 II 2 }, 
where c = D 2 /(2D 3 ). This proves the second part of (|A.35|) . 
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A.15 Proof of Lemma [A~~5l 

By the definitions of q n and qg^, we get 


q n (0,r] + Arj(0)) - q n (9o,v) = Y Wi 


r m e,7,+A7j(0)(7i) i 


-ds 


i=l 


me 0 ,r,(Ti) V ( s ) 

■ m e, v +A v (e)(Ti) _ m 0 (Tj)) 


Y 

i= 1 ^ m e 0 ,ri{Ti) 


V(s) 


ds = I — II, 


with Wi = Yi — rrio(Ti) satisfying E$Wi = 0 and Assumption [filial) . 

By applying Taylor expansion and Assumption l3ljb)) . we have for any Co,Ci,C 2 G 


^(6) 


-ds =/(6)(6 - Ci) + e i(Co)(C 2 - 6) 2 + 0((C2 - Ci) 3 ) 


(A.37) 


(A.38) 


JF(£ i) v i s ) 

=^(Co)(C2 - Cl) + e l(Co)(C2 - Cl ) 2 + e2(Co)(C2 - Cl)(Cl - Co) 

+ 0 {(C 2 - Ci ) 3 + (C2 - Ci) 2 (Ci - Co) + (C2 - Ci)(Ci - Co) 2 }, 

f F (€ 2 ) a _ F(£ n ) 1 

J ) — v(s)— ds =W*) - n &))(6 ~ C 1 ) + 2 l (€i)f(.€i)(€2 - Ci) 

+ o{(C 2 -Ci ) 3 + (C 2 -Ci) 2 (Ci-Co)} 

(Co)/(Co)(C2 - Ci)(Ci — Co) + 2 ^(Co)/(Co)(C 2 - Ci ) 2 
+ 0 {(C 2 _ Ci ) 3 + (C2 - Ci) 2 (Ci - Co) + (C2 - Ci)(Ci - Co) 2 }, 

with ei(C) and 62(C) fixed bounded functions. 

By the definitions of gg ri and Ar]( 9 ), we have 

99 0 ,v( t ) ~ 9o(T) ={g - rj 0 )(V), 

9 e, v+ A„( 0 )Cn - ge 0 , v (T) =(0 - dofh^T) + 0{\6 - 6 0 \ 2 ), 

with h\(T) = U + h*(V). Combining the above and the definition of Iq, /o and mg„, and ()A. 38 |> 
with Co = 9o, Ci = 9e 0 ,r,, and C2 = 9 e,r,+Ar,( 6 ), we get 


(A.39) 


I = (e- 6 0 ) T '£ l WMTi)h 1 (T i ) 

i= 1 

n 

+ {0~ e 0 ) T Y WMgoimh^Ti)^ - r,o)(Vi) + 0 Po (V^\0 - 0 0 \ 2 ), 

i =1 


where the last term is obtained by combining central limit theorem and the fact that E^Wi = 0 
and E 0 W 2 < 00 . 

Recall that e 2 is some bounded function in the expansion (IA.38I) . Since W t is sub-Gaussian, e 2 
and h\ are bounded functions, we have that Wie 2 (go(Ti))hi(Ti ) is also sub-Gaussian. Moreover, 
since E 0 Wie 2 (go(Ti))hi(Ti) = E 0 [e 2 (go(Ti))h 1 (Ti)E 0 (Wi\T i )} = 0, similar to (IA.25I) . by applying 
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Lemma lA.21 we get 


E 0 < sup — | Y] W%e-i (go (Tj))hi(Ti)(rj - rj 0 )(Vi)\ 
I 776 H n 


2=1 


V!,...,V n 


< 


[ V 1 + log N(e,H n , || • ||oo )de 
Jo 


< y/npl + p n ~ \fnp. 


Combining the above two, we get 

n 

I = (0- e 0 ) T J2 WMTi)hi(Ti) + Op o {V^\0 - 0 O \ 2 + n\8 - 0o\pi). (A.40) 

2=1 

Similarly, using (1 A.39|i and the same choices for £o, £i and £ 2 ) we get 

n 

II =(0 - 0 O ) T £ lo(Ti)fo(Tim + h*(Vi))(ri - Vo)(Vi) 

2=1 

1 71 

+ g E foCTO/o {Ti){{0 - 0o) T h 2 {Ti )) 2 + O Po {n\0 - 0 O | 3 + n\8 - 0 O \ 2 p n + n\0 - 0 O \p 2 n }, 

2=1 

where / 12 (f) = u — E[U\V = u]. By definition of h* , we have 

EoilomfoiT^Ui + h*(Vi))(i 7 - 770 )(V^)] 

=e 0 [(v - ^)(v;)£b(z 0 (r i )/o(r i )(^ i + h*(vi))\Vi)\ = o. 

Therefore, by applying Lemma lA.21 we get 


E 0 { sup — | y'/ 0 (Ti)/o(Ti)(/7i + h*(Vi))(r] - rjo)(Vi)\ 

I r,&H n 


2=1 


C,,..., C n 


< 

r\j 


rpn _ . 

/ V 1 + logiV(e,iL„, II • ||oo)de < \fnPn + Pn ~ 
Jo 


By central limit theorem, we have 


1 

-^(^fom)^ - 0 o ) T h 2 (T l )) 2 
2=1 

=|(0 - 0o) T i?o[/o(T)/o(T)h 1 (T)(/ ll (T)) T ] (0 - 8 0 ) + O Po (Vn\0 ~ 0o\ 2 )- 
Combining the above, we have 

II =^(0 - 0 o ) T E o [l o (T)f o (T)h 1 (T)(h 1 (T)) T ] (0 - 8 0 ) 

+ Op o {n\0 — #o| 3 + y/n\8 — 0$\ 2 + n\0 — 0q\ 2 p n + n\0 — 0q \+ \fnpfy. 
By (|A.40I) and (|A.41h . the lemma is proved. 


(A.41) 
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A.16 Proof of Theorem 15.41 


Verification of Assumption [T] We apply Lemma 13.41 We first state the following lemma for the 
Cox model with current status data, whose proof is similar to those of Lemma 7 and Lemma 8 of 
Castillo ( 2012bh for the Cox proportional hazards model and is omitted here. 


Lemma A. 6. Let Ai and A 2 be cumulative hazard functions associated with baseline hazard 
functions r\ and V 2 - Let h be the Hellinger metric. If ||?’i — r 2 ||oo + |$i — P 2 I 1$ 1, then 


h 2 (pe u A 1 ,Pe 2 ,A2) < ll r i ~ ^" 2 11 00 + \°i - °2\ 2 , 
KipeiM’PhM) ~ H r i “ ^" 2 11 00 + \°i ~ °2\ 2 , 

Vb,fc(P 0 l,Ai;P 02 ,A 2 ) ^ ll r l — ^2 ||So + 1^1 ” ^2\ k 


Let p n = n 1//3 . According to this lemma, we have {(0,r) : \9 — 0q\ < p n , ||r — ro||oo < Pn} C 
B n (P^ n \ Pn'i k). For the Riemann-Liouville prior (15.71) . it can be shown ( Castilld . 2012b . Lemma 
16) that for any Lipschitz continuous function rg and e > 0, 


n(||r - r 0 ||oo < e) > exp(-e x ) 


Let C\ be the unit ball of C[0, r] and Hi be the unit RKHS associated with the Gaussian process 
given in (15.71) . Defi ne J 7 ^ — p r ,C -\ + V lO Cnp rJHIi for some large enough C. Then according to 
Theorem 2.1 of van der Vaart and van Zanten f 2008al ). 


log N(3p n , Pn , II ' lloo) < 6 Cnpl, 

n(r Jf}) < exp (-Cnpl). 


Now we choose the sieve T n as J 7 ^ © J 7 ^ with the above J 7 ^ and J 7 = \ —C\fn , C\/n\ p . Then the 
three conditions in Lemma 13.41 are satisfied. By Lemma 25.85 of van der Vaartl ( 19981 ). we have 
d^{(9, A), (9q, Ao)) > || A — Ao||^ + \0 — Pg| 2 . Therefore, Lemma l3~4l implies that II(|| A — Ag|| n > 
Pn, \e-e 0 \ > Pn | X\ , ...,x n ) = o Po (e~ Cinp ^). 

In the rest of the proof, we set B n = {A : A is nondecreasing, ||A — Ag|| n < p n > ||A||oo < C} 
for some large enough C. Then by the above statements, we have JL(fH n \X \,..., X n ) = 1 — 
Opo(e" Cinp ')- 

Verification of (Al): By applying Taylor’s expansions on 9 and central limit theorem, the log 
likelihood difference in (Al) can be written as 


ln(0, A + AA(0)) - ln(9 0l A) =(9 - dof ]T (ZiA(Ci) + h*{Ci))Qg 0i A (X t ) 

i= 1 

- - 6o) T Te 0 ,Ao(Q - °o) + Op 0 [n\6 - 0 O | 3 + n\6 - 9 0 \ 2 p n ), 

(A.42) 

where Ie 0 ,A 0 = P{Q“e 0 a 0 (X)(ZAo(C) + h*(C)) T (Z Ag(C) + h*(C))} is the efficient information 
matrix. 

Now we focus on the first term on the RHS of (1A.42|) . Let pa(x) = (zA(c) + h* (c))Qg 0t \(x) — 
Eq[(ZA(C) + h*(C))Qe 0tA (X)\C = c]. Then Eo(g A (X)\C) = 0 for any A € B n and g\(x) is a 
continuous function of A(c) with a bounded Lipschitz constant. Therefore, the e-covering entropy 
of {g\ : A £ Bn} with respect to the conditional L 2 -norm conditioning on {Cj} is of the same order 
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as l og Y(e, ~H, 


which is of order e 1 since the functions in T-t n are bounded and nondecreasing 


( van der Vaartl. Il99a . Example 19.11). By applying Lemma lA.21 conditioning on Cp we get 


E 0 \ sup \G n [(ZA(C) + h*(C))Q eo ,A(X)\-&n[(ZA 0 (C) + h*(C))Qe oAo (X)\ 


A n eH, 


C!,...,C r 


< 


rpn _ _ 

/ V 1 + lo g N (e,H n , II • || n )de < yfpf x yfnp 2 n , 
Jo 


where the last step follows since p n = n 1//3 . 

As a result of the preceding display, we have 


Y, {ZiKCi) + h*(Ci))Qo 0)A (Xi) - Y (ZiMCi) + h*(Ci))Qe 0 , a 0 PQ) 

2=1 2=1 

n 

=O P0 {npl) + Y E o [(^A(Ci) + ^(C' i ))Q„ 0iA (X i ) - (ZiAolQ) + /i*(C' i ))Q flbiAo (X i )|C i ]. 
2=1 


(A.43) 


Note that l e ,\(X) = (ZA(C) + h\(C)) 


1995 


current status data (Ivan der Vaartl . 

Ao with A in (15.61) . By equation (25.60) on P.396 of 
Ao||£). Therefore, for A £ 0i n , 


q a is the efficient score function for the Cox model with 
Section 25. 11.1), where h\ gene ralizes h* by substituting 
der Vaart (j 19981 ), P() 0 .kJ() 0 ,a = Op 0 (||A - 


van 


n 

YMiZiMCi) + h*(Ci))Qe oA (Xi) - (ZiAo(Ci) + V(a))Q 0OjAo (^)|a] 

2=1 

n 

=O P0 (np 2 n ) + Y( h UCi) - hUC^E^Qe^Xi) - Q eoM (Xi)\C t ) 

2=1 

=0 Po (npl) + Op 0 (n||A - A 0 ||^) = 0 Po {np 2 n ), (A.44) 


since £’o(Q6»o,A 0 (A"i)|Cj) = 0 and Qq 0 i a(x ) is a continuous function of A (Cf) with a bounded Lipschitz 
constant. 

Combining (IA.42I) . (IA.43I) and (1A.44I) . we obtain 

n ~ i 

ln(e, A + AA(0)) - l n (9o,A) =(0 - 0 O ) T Y l 0oAo( x i) ~ ^ ~ 0 o)%,,a o (0 ~ ^o) 

2=1 

+ Opo ( n \@ ~ $o| 3 + n\Q ~~ |^Pn + \Znpn + n\6 — 9o\pti) • 

The verifications of A4 are similar to those in the proof of Theorem l5.1l since by the assumption of 
the theorem, h* is Lipschitz continuous. The verification of A6 is a simple version of the verification 
of (Al) and is omitted. Finally, Theorem 15.41 can be proved by applying Theorem 14.31 
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